You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Robin Qiu <ro...@google.com> on 2019/07/27 00:04:09 UTC

Re: [POPOSAL] Integrate BigQuery-compatible HyperLogLog algorithm into Beam

Quick update: the PR implementing this feature has been sent out:
https://github.com/apache/beam/pull/9144. The design doc is also revamped
to reflect the design decisions we have made.

On Tue, Jun 25, 2019 at 2:05 PM Robin Qiu <ro...@google.com> wrote:

> Can you please add this to the design documents webpage.
>> https://beam.apache.org/contribute/design-documents/
>>
>
> Thanks for the reminder. Done! (https://github.com/apache/beam/pull/8947)
>
>
>> I am not sure if this feature should go into 'sdks/java/core' because
>> it seems a quite specific case, maybe it should go in the sketching
>> module so it can be easier to find,
>
>
> Adding it to a separate module under `extensions` sounds good to me.
>
>
>> or maybe in its own extension if
>> the 'mix' of dependencies may be an issue and then make this
>> dependency a requirement for the gcp module since I suppose the
>> ultimate goal is to integrate it there.
>>
>
> I guess we can shade dependencies of ZetaSketch if it creates a problem
> when integrated with Beam. But I would not relate it to a gcp module since
> I think it will be a useful feature regardless of whether users run it on
> GCP or not (although if run on GCP, it will get better integration with
> BigQuery).
>
> On Mon, Jun 24, 2019 at 1:55 PM Ismaël Mejía <ie...@gmail.com> wrote:
>
>> Thanks for bringing this Robin,
>>
>> Can you please add this to the design documents webpage.
>> https://beam.apache.org/contribute/design-documents/
>>
>> Let some comments in the doc, It is great that this is finally open
>> and even better that it becomes part of Beam.
>>
>> I am not sure if this feature should go into 'sdks/java/core' because
>> it seems a quite specific case, maybe it should go in the sketching
>> module so it can be easier to find, or maybe in its own extension if
>> the 'mix' of dependencies may be an issue and then make this
>> dependency a requirement for the gcp module since I suppose the
>> ultimate goal is to integrate it there.
>>
>> CC +arnaudfournier921@gmail.com original author of the sketching
>> library who may be interested on this.
>>
>>
>> On Mon, Jun 24, 2019 at 9:31 PM Rui Wang <ru...@google.com> wrote:
>> >
>> > Thanks Robin! It would also be interesting if we could offer HLL_COUNT
>> functions in BeamSQL based on your proposal!
>> >
>> >
>> > -Rui
>> >
>> > On Mon, Jun 24, 2019 at 10:47 AM Robin Qiu <ro...@google.com> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I have written a doc proposing we integrate the HyperLogLog++
>> algorithm into Beam as a new combiner. The algorithm solves the
>> count-distinct problem, and the intermediate sketch (aggregator) format
>> will be compatible with sketches computed via the HLL_COUNT functions in
>> Google Cloud BigQuery (because they will be based on the same
>> implementation: ZetaSketch). The tracking JIRA issue is BEAM-7013.
>> >>
>> >> The API design proposed in the doc is subject to change and open to
>> comments. Please feel free to comment if you have any thoughts.
>> >>
>> >> Cheers,
>> >> Robin
>>
>