You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@druid.apache.org by Eshcar Hillel <es...@verizonmedia.com.INVALID> on 2019/06/30 10:12:46 UTC

Theta sketch - concurrent union implementation

Hi Everyone,
As some of you may recall a year ago we had a conversation over the mailing list regarding the synchronization of sketches https://lists.apache.org/thread.html/9899aa790a7eb561ab66f47b35c8f66ffe695432719251351339521a@%3Cdev.druid.apache.org%3E.Currently, the implementation of concurrent theta sketch is committed to the datasketches library.Details of the design and API can be found here https://datasketches.github.io/docs/Theta/ConcurrentThetaSketch.html.
We would like to continue with implementing a concurrent union operation. For this I have opened an issue suggesting 3 design alternativeshttps://github.com/apache/incubator-datasketches-java/issues/263.

With Druid being one of the main users of data sketches, and specifically the union set operation, the input of the Druid community is valuable.The advantages of a concurrent union implementation is that it is thread safe, namely allows concurrent reads and updates of the union object. The application does not need to wrap the union implementation with a synchronized call as currently done in https://github.com/apache/incubator-druid/blob/master/extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/theta/SketchAggregator.java.The core concept of a concurrent implementation is separating the object into local objects and shared object, where the data flows from local to shared.The 3 design alternative suggest different separation of read and write accesses:1) write only to local (union) read only from shared (union)2) write and read only from local (union)3) write only to local (union) read only from shared (sketch)
I would greatly appreciate if you can give your feedback in the issue I opened https://github.com/apache/incubator-datasketches-java/issues/263 so we can make the best decision (also) for Druid.
Thanks,Eshcar

Re: Theta sketch - concurrent union implementation

Posted by Gian Merlino <gi...@apache.org>.
Hey Eschar,

I see Himanshu wrote a note in
https://github.com/apache/incubator-datasketches-java/issues/263, and I
added a little bit of extra info as well. Hope it helps!

On Tue, Jul 2, 2019 at 7:06 AM Eshcar Hillel
<es...@verizonmedia.com.invalid> wrote:

>  I did some thinking and alternative 2 would not allow supporting a
> scenario of single-write-multiple-readers in druid's incremental index,
> which is the common case.So this leaves choosing between alternative 1 and
> 3.Can anyone point out advantages of having a union API to answer queries
> rather than a sketch? The only reason I can think of is being backward
> compatible with the current implementation, but this might be a good enough
> reason.     On Sunday, June 30, 2019, 1:13:30 PM GMT+3, Eshcar Hillel <
> eshcar@verizonmedia.com> wrote:
>
>  Hi Everyone,
> As some of you may recall a year ago we had a conversation over the
> mailing list regarding the synchronization of sketches
> https://lists.apache.org/thread.html/9899aa790a7eb561ab66f47b35c8f66ffe695432719251351339521a@%3Cdev.druid.apache.org%3E.Currently,
> the implementation of concurrent theta sketch is committed to the
> datasketches library.Details of the design and API can be found here
> https://datasketches.github.io/docs/Theta/ConcurrentThetaSketch.html.
> We would like to continue with implementing a concurrent union operation.
> For this I have opened an issue suggesting 3 design alternativeshttps://
> github.com/apache/incubator-datasketches-java/issues/263.
>
> With Druid being one of the main users of data sketches, and specifically
> the union set operation, the input of the Druid community is valuable.The
> advantages of a concurrent union implementation is that it is thread safe,
> namely allows concurrent reads and updates of the union object. The
> application does not need to wrap the union implementation with a
> synchronized call as currently done in
> https://github.com/apache/incubator-druid/blob/master/extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/theta/SketchAggregator.java.The
> core concept of a concurrent implementation is separating the object into
> local objects and shared object, where the data flows from local to
> shared.The 3 design alternative suggest different separation of read and
> write accesses:1) write only to local (union) read only from shared
> (union)2) write and read only from local (union)3) write only to local
> (union) read only from shared (sketch)
> I would greatly appreciate if you can give your feedback in the issue I
> opened https://github.com/apache/incubator-datasketches-java/issues/263 so
> we can make the best decision (also) for Druid.
> Thanks,Eshcar

Re: Theta sketch - concurrent union implementation

Posted by Eshcar Hillel <es...@verizonmedia.com.INVALID>.
 I did some thinking and alternative 2 would not allow supporting a scenario of single-write-multiple-readers in druid's incremental index, which is the common case.So this leaves choosing between alternative 1 and 3.Can anyone point out advantages of having a union API to answer queries rather than a sketch? The only reason I can think of is being backward compatible with the current implementation, but this might be a good enough reason.     On Sunday, June 30, 2019, 1:13:30 PM GMT+3, Eshcar Hillel <es...@verizonmedia.com> wrote:  
 
 Hi Everyone,
As some of you may recall a year ago we had a conversation over the mailing list regarding the synchronization of sketches https://lists.apache.org/thread.html/9899aa790a7eb561ab66f47b35c8f66ffe695432719251351339521a@%3Cdev.druid.apache.org%3E.Currently, the implementation of concurrent theta sketch is committed to the datasketches library.Details of the design and API can be found here https://datasketches.github.io/docs/Theta/ConcurrentThetaSketch.html.
We would like to continue with implementing a concurrent union operation. For this I have opened an issue suggesting 3 design alternativeshttps://github.com/apache/incubator-datasketches-java/issues/263.

With Druid being one of the main users of data sketches, and specifically the union set operation, the input of the Druid community is valuable.The advantages of a concurrent union implementation is that it is thread safe, namely allows concurrent reads and updates of the union object. The application does not need to wrap the union implementation with a synchronized call as currently done in https://github.com/apache/incubator-druid/blob/master/extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/theta/SketchAggregator.java.The core concept of a concurrent implementation is separating the object into local objects and shared object, where the data flows from local to shared.The 3 design alternative suggest different separation of read and write accesses:1) write only to local (union) read only from shared (union)2) write and read only from local (union)3) write only to local (union) read only from shared (sketch)
I would greatly appreciate if you can give your feedback in the issue I opened https://github.com/apache/incubator-datasketches-java/issues/263 so we can make the best decision (also) for Druid.
Thanks,Eshcar