You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Reza Rokni <re...@google.com> on 2019/11/13 02:35:21 UTC

Cleaning up Approximate Algorithms in Beam

Hi everyone;

TL/DR : Discussion on Beam's various Approximate Distinct Count algorithms.

Today there are several options for Approximate Algorithms in Apache Beam
2.16 with HLLCount being the most recently added. Would like to canvas
opinions here on the possibility of rationalizing these API's by removing
obsolete / less efficient implementations.
The current situation:

There are three options available to users: ApproximateUnique.java
<https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
ApproximateDistinct.java
<https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
and HllCount.java
<https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
A quick summary of these API's as I understand them:

HllCount.java
<https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
Marked as @Experimental

PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data streams
based on the ZetaSketch <https://github.com/google/zetasketch>
implementation.Detailed design of this class, see
https://s.apache.org/hll-in-beam.

ApproximateUnique.java
<https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
Not Marked with experimental

This API does not expose the ability to create sketches so it's not
suitable for the OLAP use case that HLL++ is geared towards (do
pre-aggregation into sketches to allow interactive query speeds). It's also
less precise for the same amount of memory used: the error bounds in the
doc comments give :

/* The error is about

{@code 2 * / sqrt(sampleSize)},) */

Compared to the default HLLCount sketch size, its error is 10X larger than
the HLL++ error.

ApproximateDistinct.java
<https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
Marked with @Experimental

This is a re-implementation of the HLL++ algorithm, based on the paper
published in 2013. It is exposing sketches via a HyperLogLogPlusCoder. We
have not run any benchmarks to compare this implementation compared to the
HLLCount and we need to be careful to ensure that if we were to change any
of these API's that the binary format of the sketches should never change,
there could be users who have stored previous sketches using
ApproximateDistinct and it will be important to try and ensure they do not
have a bad experience.


Proposals:

There are two classes of users expected for these algorithms:

1) Users who simply use the transform to estimate the size of their data
set in Beam

2) Users who want to create sketches and store them, either for
interoperability with other systems, or as features to be used in further
data processing.



For use case 1, it is possible to make use of naming which does not expose
the implementation, however for use case 2 it is important for the
implementation to be explicit as sketches produced with one implementation
will not work with other implementations.

ApproximateUnique.java
<https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
:

This one does not provide sketches and based on the notes above, is not as
efficient as HLLCount. However it does have a very searchable name and is
likely to be the one users will gravitate to when searching for Approximate
unique algorithms but do not need the capabilities of sketches.

Ideally we should think about switching the implementation of this
transform to wrap HLLCount. However this could mean changing things in a
way which is not transparent to the end developer.  Although as a result of
the change they would get a better implementation for free on an upgrade :-)

Another option would be to mark this transform as @Deprecated and create a
new transform ApproximateCountDistinct which would wrap HLLCount. The name
is also much clearer.

ApproximateDistinct.java
<https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>

This transform does generate sketches as output and given its marked as
@Experimental, one option we would have is to create a name which includes
the algorithm implementation details, for example
ApproximateCountDistinctClearSpring.



HllCount.java
<https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
.

Again we have a few options here, as the name does not include search words
like Approximate, we can create a wrapper which has a name similar to
ApproximateCountDistinctZetaSketch.



Thoughts?



Cheers



Reza





-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Cleaning up Approximate Algorithms in Beam

Posted by Reza Rokni <re...@google.com>.
Hi,

Sorry it took almost a year before we found time...

https://github.com/apache/beam/pull/12973 ( Robin and *Andrea  have agreed
to review). *

With this PR the old ApproximateUnique will be marked as deprecated. With
notes to make use of ApproximateCountDistinct.java
<https://github.com/apache/beam/pull/12973/files#diff-8b3de19b55328b9ca265ec25d57c86e2c59a9c505195146df4af5f9eb73e5fc8>
.

Thanx

Reza

On Wed, Nov 27, 2019 at 2:29 AM Robert Bradshaw <ro...@google.com> wrote:

> I think this thread is sufficient.
>
> On Mon, Nov 25, 2019 at 5:59 PM Reza Rokni <re...@google.com> wrote:
>
>> Hi,
>>
>> So do we need a vote for the final list of actions? Or is this thread
>> enough to go ahead and raise the PR's?
>>
>> Cheers
>>
>> Reza
>>
>> On Tue, 26 Nov 2019 at 06:01, Ahmet Altay <al...@google.com> wrote:
>>
>>>
>>>
>>> On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni <re...@google.com> wrote:
>>>>
>>>>> *Ahmet: FWIW, There is a python implementation only for this
>>>>> version: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38> *
>>>>> Eventually we will be able to make use of cross language transforms to
>>>>> help with feature parity. Until then, are we ok with marking this
>>>>> deprecated in python, even though we do not have another solution. Or leave
>>>>> it as is in Python now, as it does not have sketch capability so can only
>>>>> be used for outputting results directly from the pipeline.
>>>>>
>>>>
>>> If it is our intention to add the capability eventually, IMO it makes
>>> sense to mark the existing functionality deprecated in Python as well.
>>>
>>>
>>>>> *Reuven: I think this is the sort of thing that has been experimental
>>>>> forever, and therefore not experimental (e.g. the entire triggering API is
>>>>> experimental as are all our file-based sinks). I think that many users use
>>>>> this, and probably store the state implicitly in streaming pipelines.*
>>>>> True, I have an old action item to try and go through and PR against
>>>>> old @experimental annotations but need to find time. So for this
>>>>> discussion; I guess this should be marked as deprecated if we change it
>>>>> even though its @experimental.
>>>>>
>>>>
>>>> Agreed.
>>>>
>>>>
>>>>> *Rob: I'm not following this--by naming things after their
>>>>> implementation rather than their intent I think they will be harder to
>>>>> search for. *
>>>>> This is to add to the name the implementation, after the intent. For
>>>>> example ApproximateCountDistinctZetaSketch, I believe should be easy to
>>>>> search for and it is clear which implementation is used. Allowing for a
>>>>> potentially better implementation ApproximateCountDistinct<FooBar>.
>>>>>
>>>>
>>>> OK, if we have both I'm more OK with that. This is better than the
>>>> names like HllCount, which seems to be what was suggested.
>>>>
>>>> Another approach would be to have a required  parameter which is an
>>>>> enum of the implementation options.
>>>>> ApproximateCountDistinct.of().usingImpl(ZETA) ?
>>>>>
>>>>
>>>> Ideally this could be an optional parameter, or possibly only required
>>>> during update until we figure out a good way for the runner to plug this in
>>>> appropreately.
>>>>
>>>> Rob/Kenn: On Combiner discussion, should we tie action items from the
>>>>> needs of this thread to this larger discussion?
>>>>>
>>>>> Cheers
>>>>> Reza
>>>>>
>>>>> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles <ke...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Wow. Nice summary, yes. Major calls to action:
>>>>>>>
>>>>>>> 0. Never allow a combiner that does not include the format of its
>>>>>>> state clear in its name/URN. The "update compatibility" problem makes their
>>>>>>> internal accumulator state essentially part of their public API. Combiners
>>>>>>> named for what they do are an inherent risk, since we might have a new way
>>>>>>> to do the same operation with different implementation-detail state.
>>>>>>>
>>>>>>
>>>>>> It seems this will make for a worse user experience, motivated solely
>>>>>> by limitations in our implementation. I think we can do better.
>>>>>> Hypothetical idea: what if upgrade required access to the original graph
>>>>>> (or at least metadata about it) during construction? In this case an
>>>>>> ApproximateDistinct could look at what was used last time and try to do the
>>>>>> same, but be free to do something better when unconstrained. Another
>>>>>> approach would be to encode several alternative expansions in the Beam
>>>>>> graph and let the runner do the picking (based on prior submission).
>>>>>> (Making the CombineFn, as opposed to the composite, have several
>>>>>> alternatives seems harder to reason about, but maybe worth pursuing as
>>>>>> well).
>>>>>>
>>>>>> This is not unique to Combiners, but any stateful DoFn, or composite
>>>>>> operations with non-trivial internal structure (and coders). This has been
>>>>>> discussed a lot, perhaps there are some ideas there we could borrow?
>>>>>>
>>>>>> And they will match search terms better, which is a major problem.
>>>>>>>
>>>>>>
>>>>>> I'm not following this--by naming things after their implementation
>>>>>> rather than their intent I think they will be harder to search for.
>>>>>>
>>>>>>
>>>>>>> 1. Point users to HllCount. This seems to be the best of the three.
>>>>>>> Does it have a name that is clear enough about the format of its state?
>>>>>>> Noting that its Java package name includes zetasketch, perhaps.
>>>>>>> 2. Deprecate the others, at least. And remove them from e.g. Javadoc.
>>>>>>>
>>>>>>
>>>>>> +1
>>>>>>
>>>>>>
>>>>>>> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <re...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <al...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thank you for writing this summary.
>>>>>>>>>
>>>>>>>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <re...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone;
>>>>>>>>>>
>>>>>>>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>>>>>>>>>> algorithms.
>>>>>>>>>>
>>>>>>>>>> Today there are several options for Approximate Algorithms in
>>>>>>>>>> Apache Beam 2.16 with HLLCount being the most recently added. Would like to
>>>>>>>>>> canvas opinions here on the possibility of rationalizing these API's by
>>>>>>>>>> removing obsolete / less efficient implementations.
>>>>>>>>>> The current situation:
>>>>>>>>>>
>>>>>>>>>> There are three options available to users:
>>>>>>>>>> ApproximateUnique.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
>>>>>>>>>> ApproximateDistinct.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>>>> and HllCount.java
>>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
>>>>>>>>>> A quick summary of these API's as I understand them:
>>>>>>>>>>
>>>>>>>>>> HllCount.java
>>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
>>>>>>>>>> Marked as @Experimental
>>>>>>>>>>
>>>>>>>>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on
>>>>>>>>>> data streams based on the ZetaSketch
>>>>>>>>>> <https://github.com/google/zetasketch> implementation.Detailed
>>>>>>>>>> design of this class, see https://s.apache.org/hll-in-beam.
>>>>>>>>>>
>>>>>>>>>> ApproximateUnique.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
>>>>>>>>>> Not Marked with experimental
>>>>>>>>>>
>>>>>>>>>> This API does not expose the ability to create sketches so it's
>>>>>>>>>> not suitable for the OLAP use case that HLL++ is geared towards (do
>>>>>>>>>> pre-aggregation into sketches to allow interactive query speeds). It's also
>>>>>>>>>> less precise for the same amount of memory used: the error bounds in the
>>>>>>>>>> doc comments give :
>>>>>>>>>>
>>>>>>>>>> /* The error is about
>>>>>>>>>>
>>>>>>>>>> {@code 2 * / sqrt(sampleSize)},) */
>>>>>>>>>>
>>>>>>>>>> Compared to the default HLLCount sketch size, its error is 10X
>>>>>>>>>> larger than the HLL++ error.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> FWIW, There is a python implementation only for this version:
>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> ApproximateDistinct.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>>>> Marked with @Experimental
>>>>>>>>>>
>>>>>>>>>> This is a re-implementation of the HLL++ algorithm, based on the
>>>>>>>>>> paper published in 2013. It is exposing sketches via a
>>>>>>>>>> HyperLogLogPlusCoder. We have not run any benchmarks to compare this
>>>>>>>>>> implementation compared to the HLLCount and we need to be careful to ensure
>>>>>>>>>> that if we were to change any of these API's that the binary format of the
>>>>>>>>>> sketches should never change, there could be users who have stored previous
>>>>>>>>>> sketches using ApproximateDistinct and it will be important to try and
>>>>>>>>>> ensure they do not have a bad experience.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Proposals:
>>>>>>>>>>
>>>>>>>>>> There are two classes of users expected for these algorithms:
>>>>>>>>>>
>>>>>>>>>> 1) Users who simply use the transform to estimate the size of
>>>>>>>>>> their data set in Beam
>>>>>>>>>>
>>>>>>>>>> 2) Users who want to create sketches and store them, either for
>>>>>>>>>> interoperability with other systems, or as features to be used in further
>>>>>>>>>> data processing.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> For use case 1, it is possible to make use of naming which does
>>>>>>>>>> not expose the implementation, however for use case 2 it is important for
>>>>>>>>>> the implementation to be explicit as sketches produced with one
>>>>>>>>>> implementation will not work with other implementations.
>>>>>>>>>>
>>>>>>>>>> ApproximateUnique.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
>>>>>>>>>> :
>>>>>>>>>>
>>>>>>>>>> This one does not provide sketches and based on the notes above,
>>>>>>>>>> is not as efficient as HLLCount. However it does have a very searchable
>>>>>>>>>> name and is likely to be the one users will gravitate to when searching for
>>>>>>>>>> Approximate unique algorithms but do not need the capabilities of sketches.
>>>>>>>>>>
>>>>>>>>>> Ideally we should think about switching the implementation of
>>>>>>>>>> this transform to wrap HLLCount. However this could mean changing things in
>>>>>>>>>> a way which is not transparent to the end developer.  Although as a result
>>>>>>>>>> of the change they would get a better implementation for free on an upgrade
>>>>>>>>>> :-)
>>>>>>>>>>
>>>>>>>>>> Another option would be to mark this transform as @Deprecated and
>>>>>>>>>> create a new transform ApproximateCountDistinct which would wrap
>>>>>>>>>> HLLCount. The name is also much clearer.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Marking it deprecated instead of changing its implementation will
>>>>>>>>> probably create a less surprising experience for the users.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ApproximateDistinct.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>>>>
>>>>>>>>>> This transform does generate sketches as output and given its
>>>>>>>>>> marked as @Experimental, one option we would have is to create a name which
>>>>>>>>>> includes the algorithm implementation details, for example
>>>>>>>>>> ApproximateCountDistinctClearSpring.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can we remove this, if it is both experimental and providing the
>>>>>>>>> same utility as HllCount? Is the concern that users might have stored
>>>>>>>>> sketches?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think this is the sort of thing that has been experimental
>>>>>>>> forever, and therefore not experimental (e.g. the entire triggering API is
>>>>>>>> experimental as are all our file-based sinks). I think that many users use
>>>>>>>> this, and probably store the state implicitly in streaming pipelines.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> HllCount.java
>>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>> Again we have a few options here, as the name does not include
>>>>>>>>>> search words like Approximate, we can create a wrapper which has a name
>>>>>>>>>> similar to ApproximateCountDistinctZetaSketch.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thoughts?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Reza
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> This email may be confidential and privileged. If you received
>>>>>>>>>> this communication by mistake, please don't forward it to anyone else,
>>>>>>>>>> please erase all copies and attachments, and please let me know that it has
>>>>>>>>>> gone to the wrong person.
>>>>>>>>>>
>>>>>>>>>> The above terms reflect a potential business arrangement, are
>>>>>>>>>> provided solely as a basis for further discussion, and are not intended to
>>>>>>>>>> be and do not constitute a legally binding obligation. No legally binding
>>>>>>>>>> obligations will be created, implied, or inferred until an agreement in
>>>>>>>>>> final form is executed in writing by all parties involved.
>>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> This email may be confidential and privileged. If you received this
>>>>> communication by mistake, please don't forward it to anyone else, please
>>>>> erase all copies and attachments, and please let me know that it has gone
>>>>> to the wrong person.
>>>>>
>>>>> The above terms reflect a potential business arrangement, are provided
>>>>> solely as a basis for further discussion, and are not intended to be and do
>>>>> not constitute a legally binding obligation. No legally binding obligations
>>>>> will be created, implied, or inferred until an agreement in final form is
>>>>> executed in writing by all parties involved.
>>>>>
>>>>
>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>

Re: Cleaning up Approximate Algorithms in Beam

Posted by Reza Rokni <re...@google.com>.
Hi,

I will be making time for completing this in March, with completion and
reviews planned for April.

Cheers

Reza

On Wed, 27 Nov 2019 at 02:29, Robert Bradshaw <ro...@google.com> wrote:

> I think this thread is sufficient.
>
> On Mon, Nov 25, 2019 at 5:59 PM Reza Rokni <re...@google.com> wrote:
>
>> Hi,
>>
>> So do we need a vote for the final list of actions? Or is this thread
>> enough to go ahead and raise the PR's?
>>
>> Cheers
>>
>> Reza
>>
>> On Tue, 26 Nov 2019 at 06:01, Ahmet Altay <al...@google.com> wrote:
>>
>>>
>>>
>>> On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni <re...@google.com> wrote:
>>>>
>>>>> *Ahmet: FWIW, There is a python implementation only for this
>>>>> version: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38> *
>>>>> Eventually we will be able to make use of cross language transforms to
>>>>> help with feature parity. Until then, are we ok with marking this
>>>>> deprecated in python, even though we do not have another solution. Or leave
>>>>> it as is in Python now, as it does not have sketch capability so can only
>>>>> be used for outputting results directly from the pipeline.
>>>>>
>>>>
>>> If it is our intention to add the capability eventually, IMO it makes
>>> sense to mark the existing functionality deprecated in Python as well.
>>>
>>>
>>>>> *Reuven: I think this is the sort of thing that has been experimental
>>>>> forever, and therefore not experimental (e.g. the entire triggering API is
>>>>> experimental as are all our file-based sinks). I think that many users use
>>>>> this, and probably store the state implicitly in streaming pipelines.*
>>>>> True, I have an old action item to try and go through and PR against
>>>>> old @experimental annotations but need to find time. So for this
>>>>> discussion; I guess this should be marked as deprecated if we change it
>>>>> even though its @experimental.
>>>>>
>>>>
>>>> Agreed.
>>>>
>>>>
>>>>> *Rob: I'm not following this--by naming things after their
>>>>> implementation rather than their intent I think they will be harder to
>>>>> search for. *
>>>>> This is to add to the name the implementation, after the intent. For
>>>>> example ApproximateCountDistinctZetaSketch, I believe should be easy to
>>>>> search for and it is clear which implementation is used. Allowing for a
>>>>> potentially better implementation ApproximateCountDistinct<FooBar>.
>>>>>
>>>>
>>>> OK, if we have both I'm more OK with that. This is better than the
>>>> names like HllCount, which seems to be what was suggested.
>>>>
>>>> Another approach would be to have a required  parameter which is an
>>>>> enum of the implementation options.
>>>>> ApproximateCountDistinct.of().usingImpl(ZETA) ?
>>>>>
>>>>
>>>> Ideally this could be an optional parameter, or possibly only required
>>>> during update until we figure out a good way for the runner to plug this in
>>>> appropreately.
>>>>
>>>> Rob/Kenn: On Combiner discussion, should we tie action items from the
>>>>> needs of this thread to this larger discussion?
>>>>>
>>>>> Cheers
>>>>> Reza
>>>>>
>>>>> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles <ke...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Wow. Nice summary, yes. Major calls to action:
>>>>>>>
>>>>>>> 0. Never allow a combiner that does not include the format of its
>>>>>>> state clear in its name/URN. The "update compatibility" problem makes their
>>>>>>> internal accumulator state essentially part of their public API. Combiners
>>>>>>> named for what they do are an inherent risk, since we might have a new way
>>>>>>> to do the same operation with different implementation-detail state.
>>>>>>>
>>>>>>
>>>>>> It seems this will make for a worse user experience, motivated solely
>>>>>> by limitations in our implementation. I think we can do better.
>>>>>> Hypothetical idea: what if upgrade required access to the original graph
>>>>>> (or at least metadata about it) during construction? In this case an
>>>>>> ApproximateDistinct could look at what was used last time and try to do the
>>>>>> same, but be free to do something better when unconstrained. Another
>>>>>> approach would be to encode several alternative expansions in the Beam
>>>>>> graph and let the runner do the picking (based on prior submission).
>>>>>> (Making the CombineFn, as opposed to the composite, have several
>>>>>> alternatives seems harder to reason about, but maybe worth pursuing as
>>>>>> well).
>>>>>>
>>>>>> This is not unique to Combiners, but any stateful DoFn, or composite
>>>>>> operations with non-trivial internal structure (and coders). This has been
>>>>>> discussed a lot, perhaps there are some ideas there we could borrow?
>>>>>>
>>>>>> And they will match search terms better, which is a major problem.
>>>>>>>
>>>>>>
>>>>>> I'm not following this--by naming things after their implementation
>>>>>> rather than their intent I think they will be harder to search for.
>>>>>>
>>>>>>
>>>>>>> 1. Point users to HllCount. This seems to be the best of the three.
>>>>>>> Does it have a name that is clear enough about the format of its state?
>>>>>>> Noting that its Java package name includes zetasketch, perhaps.
>>>>>>> 2. Deprecate the others, at least. And remove them from e.g. Javadoc.
>>>>>>>
>>>>>>
>>>>>> +1
>>>>>>
>>>>>>
>>>>>>> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <re...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <al...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thank you for writing this summary.
>>>>>>>>>
>>>>>>>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <re...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone;
>>>>>>>>>>
>>>>>>>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>>>>>>>>>> algorithms.
>>>>>>>>>>
>>>>>>>>>> Today there are several options for Approximate Algorithms in
>>>>>>>>>> Apache Beam 2.16 with HLLCount being the most recently added. Would like to
>>>>>>>>>> canvas opinions here on the possibility of rationalizing these API's by
>>>>>>>>>> removing obsolete / less efficient implementations.
>>>>>>>>>> The current situation:
>>>>>>>>>>
>>>>>>>>>> There are three options available to users:
>>>>>>>>>> ApproximateUnique.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
>>>>>>>>>> ApproximateDistinct.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>>>> and HllCount.java
>>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
>>>>>>>>>> A quick summary of these API's as I understand them:
>>>>>>>>>>
>>>>>>>>>> HllCount.java
>>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
>>>>>>>>>> Marked as @Experimental
>>>>>>>>>>
>>>>>>>>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on
>>>>>>>>>> data streams based on the ZetaSketch
>>>>>>>>>> <https://github.com/google/zetasketch> implementation.Detailed
>>>>>>>>>> design of this class, see https://s.apache.org/hll-in-beam.
>>>>>>>>>>
>>>>>>>>>> ApproximateUnique.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
>>>>>>>>>> Not Marked with experimental
>>>>>>>>>>
>>>>>>>>>> This API does not expose the ability to create sketches so it's
>>>>>>>>>> not suitable for the OLAP use case that HLL++ is geared towards (do
>>>>>>>>>> pre-aggregation into sketches to allow interactive query speeds). It's also
>>>>>>>>>> less precise for the same amount of memory used: the error bounds in the
>>>>>>>>>> doc comments give :
>>>>>>>>>>
>>>>>>>>>> /* The error is about
>>>>>>>>>>
>>>>>>>>>> {@code 2 * / sqrt(sampleSize)},) */
>>>>>>>>>>
>>>>>>>>>> Compared to the default HLLCount sketch size, its error is 10X
>>>>>>>>>> larger than the HLL++ error.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> FWIW, There is a python implementation only for this version:
>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> ApproximateDistinct.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>>>> Marked with @Experimental
>>>>>>>>>>
>>>>>>>>>> This is a re-implementation of the HLL++ algorithm, based on the
>>>>>>>>>> paper published in 2013. It is exposing sketches via a
>>>>>>>>>> HyperLogLogPlusCoder. We have not run any benchmarks to compare this
>>>>>>>>>> implementation compared to the HLLCount and we need to be careful to ensure
>>>>>>>>>> that if we were to change any of these API's that the binary format of the
>>>>>>>>>> sketches should never change, there could be users who have stored previous
>>>>>>>>>> sketches using ApproximateDistinct and it will be important to try and
>>>>>>>>>> ensure they do not have a bad experience.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Proposals:
>>>>>>>>>>
>>>>>>>>>> There are two classes of users expected for these algorithms:
>>>>>>>>>>
>>>>>>>>>> 1) Users who simply use the transform to estimate the size of
>>>>>>>>>> their data set in Beam
>>>>>>>>>>
>>>>>>>>>> 2) Users who want to create sketches and store them, either for
>>>>>>>>>> interoperability with other systems, or as features to be used in further
>>>>>>>>>> data processing.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> For use case 1, it is possible to make use of naming which does
>>>>>>>>>> not expose the implementation, however for use case 2 it is important for
>>>>>>>>>> the implementation to be explicit as sketches produced with one
>>>>>>>>>> implementation will not work with other implementations.
>>>>>>>>>>
>>>>>>>>>> ApproximateUnique.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
>>>>>>>>>> :
>>>>>>>>>>
>>>>>>>>>> This one does not provide sketches and based on the notes above,
>>>>>>>>>> is not as efficient as HLLCount. However it does have a very searchable
>>>>>>>>>> name and is likely to be the one users will gravitate to when searching for
>>>>>>>>>> Approximate unique algorithms but do not need the capabilities of sketches.
>>>>>>>>>>
>>>>>>>>>> Ideally we should think about switching the implementation of
>>>>>>>>>> this transform to wrap HLLCount. However this could mean changing things in
>>>>>>>>>> a way which is not transparent to the end developer.  Although as a result
>>>>>>>>>> of the change they would get a better implementation for free on an upgrade
>>>>>>>>>> :-)
>>>>>>>>>>
>>>>>>>>>> Another option would be to mark this transform as @Deprecated and
>>>>>>>>>> create a new transform ApproximateCountDistinct which would wrap
>>>>>>>>>> HLLCount. The name is also much clearer.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Marking it deprecated instead of changing its implementation will
>>>>>>>>> probably create a less surprising experience for the users.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ApproximateDistinct.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>>>>
>>>>>>>>>> This transform does generate sketches as output and given its
>>>>>>>>>> marked as @Experimental, one option we would have is to create a name which
>>>>>>>>>> includes the algorithm implementation details, for example
>>>>>>>>>> ApproximateCountDistinctClearSpring.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can we remove this, if it is both experimental and providing the
>>>>>>>>> same utility as HllCount? Is the concern that users might have stored
>>>>>>>>> sketches?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think this is the sort of thing that has been experimental
>>>>>>>> forever, and therefore not experimental (e.g. the entire triggering API is
>>>>>>>> experimental as are all our file-based sinks). I think that many users use
>>>>>>>> this, and probably store the state implicitly in streaming pipelines.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> HllCount.java
>>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>> Again we have a few options here, as the name does not include
>>>>>>>>>> search words like Approximate, we can create a wrapper which has a name
>>>>>>>>>> similar to ApproximateCountDistinctZetaSketch.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thoughts?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Reza
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> This email may be confidential and privileged. If you received
>>>>>>>>>> this communication by mistake, please don't forward it to anyone else,
>>>>>>>>>> please erase all copies and attachments, and please let me know that it has
>>>>>>>>>> gone to the wrong person.
>>>>>>>>>>
>>>>>>>>>> The above terms reflect a potential business arrangement, are
>>>>>>>>>> provided solely as a basis for further discussion, and are not intended to
>>>>>>>>>> be and do not constitute a legally binding obligation. No legally binding
>>>>>>>>>> obligations will be created, implied, or inferred until an agreement in
>>>>>>>>>> final form is executed in writing by all parties involved.
>>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> This email may be confidential and privileged. If you received this
>>>>> communication by mistake, please don't forward it to anyone else, please
>>>>> erase all copies and attachments, and please let me know that it has gone
>>>>> to the wrong person.
>>>>>
>>>>> The above terms reflect a potential business arrangement, are provided
>>>>> solely as a basis for further discussion, and are not intended to be and do
>>>>> not constitute a legally binding obligation. No legally binding obligations
>>>>> will be created, implied, or inferred until an agreement in final form is
>>>>> executed in writing by all parties involved.
>>>>>
>>>>
>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>

Re: Cleaning up Approximate Algorithms in Beam

Posted by Robert Bradshaw <ro...@google.com>.
I think this thread is sufficient.

On Mon, Nov 25, 2019 at 5:59 PM Reza Rokni <re...@google.com> wrote:

> Hi,
>
> So do we need a vote for the final list of actions? Or is this thread
> enough to go ahead and raise the PR's?
>
> Cheers
>
> Reza
>
> On Tue, 26 Nov 2019 at 06:01, Ahmet Altay <al...@google.com> wrote:
>
>>
>>
>> On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni <re...@google.com> wrote:
>>>
>>>> *Ahmet: FWIW, There is a python implementation only for this
>>>> version: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38> *
>>>> Eventually we will be able to make use of cross language transforms to
>>>> help with feature parity. Until then, are we ok with marking this
>>>> deprecated in python, even though we do not have another solution. Or leave
>>>> it as is in Python now, as it does not have sketch capability so can only
>>>> be used for outputting results directly from the pipeline.
>>>>
>>>
>> If it is our intention to add the capability eventually, IMO it makes
>> sense to mark the existing functionality deprecated in Python as well.
>>
>>
>>>> *Reuven: I think this is the sort of thing that has been experimental
>>>> forever, and therefore not experimental (e.g. the entire triggering API is
>>>> experimental as are all our file-based sinks). I think that many users use
>>>> this, and probably store the state implicitly in streaming pipelines.*
>>>> True, I have an old action item to try and go through and PR against
>>>> old @experimental annotations but need to find time. So for this
>>>> discussion; I guess this should be marked as deprecated if we change it
>>>> even though its @experimental.
>>>>
>>>
>>> Agreed.
>>>
>>>
>>>> *Rob: I'm not following this--by naming things after their
>>>> implementation rather than their intent I think they will be harder to
>>>> search for. *
>>>> This is to add to the name the implementation, after the intent. For
>>>> example ApproximateCountDistinctZetaSketch, I believe should be easy to
>>>> search for and it is clear which implementation is used. Allowing for a
>>>> potentially better implementation ApproximateCountDistinct<FooBar>.
>>>>
>>>
>>> OK, if we have both I'm more OK with that. This is better than the names
>>> like HllCount, which seems to be what was suggested.
>>>
>>> Another approach would be to have a required  parameter which is an enum
>>>> of the implementation options.
>>>> ApproximateCountDistinct.of().usingImpl(ZETA) ?
>>>>
>>>
>>> Ideally this could be an optional parameter, or possibly only required
>>> during update until we figure out a good way for the runner to plug this in
>>> appropreately.
>>>
>>> Rob/Kenn: On Combiner discussion, should we tie action items from the
>>>> needs of this thread to this larger discussion?
>>>>
>>>> Cheers
>>>> Reza
>>>>
>>>> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw <ro...@google.com>
>>>> wrote:
>>>>
>>>>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles <ke...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Wow. Nice summary, yes. Major calls to action:
>>>>>>
>>>>>> 0. Never allow a combiner that does not include the format of its
>>>>>> state clear in its name/URN. The "update compatibility" problem makes their
>>>>>> internal accumulator state essentially part of their public API. Combiners
>>>>>> named for what they do are an inherent risk, since we might have a new way
>>>>>> to do the same operation with different implementation-detail state.
>>>>>>
>>>>>
>>>>> It seems this will make for a worse user experience, motivated solely
>>>>> by limitations in our implementation. I think we can do better.
>>>>> Hypothetical idea: what if upgrade required access to the original graph
>>>>> (or at least metadata about it) during construction? In this case an
>>>>> ApproximateDistinct could look at what was used last time and try to do the
>>>>> same, but be free to do something better when unconstrained. Another
>>>>> approach would be to encode several alternative expansions in the Beam
>>>>> graph and let the runner do the picking (based on prior submission).
>>>>> (Making the CombineFn, as opposed to the composite, have several
>>>>> alternatives seems harder to reason about, but maybe worth pursuing as
>>>>> well).
>>>>>
>>>>> This is not unique to Combiners, but any stateful DoFn, or composite
>>>>> operations with non-trivial internal structure (and coders). This has been
>>>>> discussed a lot, perhaps there are some ideas there we could borrow?
>>>>>
>>>>> And they will match search terms better, which is a major problem.
>>>>>>
>>>>>
>>>>> I'm not following this--by naming things after their implementation
>>>>> rather than their intent I think they will be harder to search for.
>>>>>
>>>>>
>>>>>> 1. Point users to HllCount. This seems to be the best of the three.
>>>>>> Does it have a name that is clear enough about the format of its state?
>>>>>> Noting that its Java package name includes zetasketch, perhaps.
>>>>>> 2. Deprecate the others, at least. And remove them from e.g. Javadoc.
>>>>>>
>>>>>
>>>>> +1
>>>>>
>>>>>
>>>>>> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <re...@google.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <al...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thank you for writing this summary.
>>>>>>>>
>>>>>>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <re...@google.com> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone;
>>>>>>>>>
>>>>>>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>>>>>>>>> algorithms.
>>>>>>>>>
>>>>>>>>> Today there are several options for Approximate Algorithms in
>>>>>>>>> Apache Beam 2.16 with HLLCount being the most recently added. Would like to
>>>>>>>>> canvas opinions here on the possibility of rationalizing these API's by
>>>>>>>>> removing obsolete / less efficient implementations.
>>>>>>>>> The current situation:
>>>>>>>>>
>>>>>>>>> There are three options available to users: ApproximateUnique.java
>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
>>>>>>>>> ApproximateDistinct.java
>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>>> and HllCount.java
>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
>>>>>>>>> A quick summary of these API's as I understand them:
>>>>>>>>>
>>>>>>>>> HllCount.java
>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
>>>>>>>>> Marked as @Experimental
>>>>>>>>>
>>>>>>>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on
>>>>>>>>> data streams based on the ZetaSketch
>>>>>>>>> <https://github.com/google/zetasketch> implementation.Detailed
>>>>>>>>> design of this class, see https://s.apache.org/hll-in-beam.
>>>>>>>>>
>>>>>>>>> ApproximateUnique.java
>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
>>>>>>>>> Not Marked with experimental
>>>>>>>>>
>>>>>>>>> This API does not expose the ability to create sketches so it's
>>>>>>>>> not suitable for the OLAP use case that HLL++ is geared towards (do
>>>>>>>>> pre-aggregation into sketches to allow interactive query speeds). It's also
>>>>>>>>> less precise for the same amount of memory used: the error bounds in the
>>>>>>>>> doc comments give :
>>>>>>>>>
>>>>>>>>> /* The error is about
>>>>>>>>>
>>>>>>>>> {@code 2 * / sqrt(sampleSize)},) */
>>>>>>>>>
>>>>>>>>> Compared to the default HLLCount sketch size, its error is 10X
>>>>>>>>> larger than the HLL++ error.
>>>>>>>>>
>>>>>>>>
>>>>>>>> FWIW, There is a python implementation only for this version:
>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> ApproximateDistinct.java
>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>>> Marked with @Experimental
>>>>>>>>>
>>>>>>>>> This is a re-implementation of the HLL++ algorithm, based on the
>>>>>>>>> paper published in 2013. It is exposing sketches via a
>>>>>>>>> HyperLogLogPlusCoder. We have not run any benchmarks to compare this
>>>>>>>>> implementation compared to the HLLCount and we need to be careful to ensure
>>>>>>>>> that if we were to change any of these API's that the binary format of the
>>>>>>>>> sketches should never change, there could be users who have stored previous
>>>>>>>>> sketches using ApproximateDistinct and it will be important to try and
>>>>>>>>> ensure they do not have a bad experience.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Proposals:
>>>>>>>>>
>>>>>>>>> There are two classes of users expected for these algorithms:
>>>>>>>>>
>>>>>>>>> 1) Users who simply use the transform to estimate the size of
>>>>>>>>> their data set in Beam
>>>>>>>>>
>>>>>>>>> 2) Users who want to create sketches and store them, either for
>>>>>>>>> interoperability with other systems, or as features to be used in further
>>>>>>>>> data processing.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> For use case 1, it is possible to make use of naming which does
>>>>>>>>> not expose the implementation, however for use case 2 it is important for
>>>>>>>>> the implementation to be explicit as sketches produced with one
>>>>>>>>> implementation will not work with other implementations.
>>>>>>>>>
>>>>>>>>> ApproximateUnique.java
>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
>>>>>>>>> :
>>>>>>>>>
>>>>>>>>> This one does not provide sketches and based on the notes above,
>>>>>>>>> is not as efficient as HLLCount. However it does have a very searchable
>>>>>>>>> name and is likely to be the one users will gravitate to when searching for
>>>>>>>>> Approximate unique algorithms but do not need the capabilities of sketches.
>>>>>>>>>
>>>>>>>>> Ideally we should think about switching the implementation of this
>>>>>>>>> transform to wrap HLLCount. However this could mean changing things in a
>>>>>>>>> way which is not transparent to the end developer.  Although as a result of
>>>>>>>>> the change they would get a better implementation for free on an upgrade :-)
>>>>>>>>>
>>>>>>>>> Another option would be to mark this transform as @Deprecated and
>>>>>>>>> create a new transform ApproximateCountDistinct which would wrap
>>>>>>>>> HLLCount. The name is also much clearer.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Marking it deprecated instead of changing its implementation will
>>>>>>>> probably create a less surprising experience for the users.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> ApproximateDistinct.java
>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>>>
>>>>>>>>> This transform does generate sketches as output and given its
>>>>>>>>> marked as @Experimental, one option we would have is to create a name which
>>>>>>>>> includes the algorithm implementation details, for example
>>>>>>>>> ApproximateCountDistinctClearSpring.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Can we remove this, if it is both experimental and providing the
>>>>>>>> same utility as HllCount? Is the concern that users might have stored
>>>>>>>> sketches?
>>>>>>>>
>>>>>>>
>>>>>>> I think this is the sort of thing that has been experimental
>>>>>>> forever, and therefore not experimental (e.g. the entire triggering API is
>>>>>>> experimental as are all our file-based sinks). I think that many users use
>>>>>>> this, and probably store the state implicitly in streaming pipelines.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> HllCount.java
>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> Again we have a few options here, as the name does not include
>>>>>>>>> search words like Approximate, we can create a wrapper which has a name
>>>>>>>>> similar to ApproximateCountDistinctZetaSketch.
>>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Reza
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> This email may be confidential and privileged. If you received
>>>>>>>>> this communication by mistake, please don't forward it to anyone else,
>>>>>>>>> please erase all copies and attachments, and please let me know that it has
>>>>>>>>> gone to the wrong person.
>>>>>>>>>
>>>>>>>>> The above terms reflect a potential business arrangement, are
>>>>>>>>> provided solely as a basis for further discussion, and are not intended to
>>>>>>>>> be and do not constitute a legally binding obligation. No legally binding
>>>>>>>>> obligations will be created, implied, or inferred until an agreement in
>>>>>>>>> final form is executed in writing by all parties involved.
>>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>>
>>>> This email may be confidential and privileged. If you received this
>>>> communication by mistake, please don't forward it to anyone else, please
>>>> erase all copies and attachments, and please let me know that it has gone
>>>> to the wrong person.
>>>>
>>>> The above terms reflect a potential business arrangement, are provided
>>>> solely as a basis for further discussion, and are not intended to be and do
>>>> not constitute a legally binding obligation. No legally binding obligations
>>>> will be created, implied, or inferred until an agreement in final form is
>>>> executed in writing by all parties involved.
>>>>
>>>
>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
>
> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and do
> not constitute a legally binding obligation. No legally binding obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>

Re: Cleaning up Approximate Algorithms in Beam

Posted by Reza Rokni <re...@google.com>.
Hi,

So do we need a vote for the final list of actions? Or is this thread
enough to go ahead and raise the PR's?

Cheers

Reza

On Tue, 26 Nov 2019 at 06:01, Ahmet Altay <al...@google.com> wrote:

>
>
> On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni <re...@google.com> wrote:
>>
>>> *Ahmet: FWIW, There is a python implementation only for this
>>> version: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38> *
>>> Eventually we will be able to make use of cross language transforms to
>>> help with feature parity. Until then, are we ok with marking this
>>> deprecated in python, even though we do not have another solution. Or leave
>>> it as is in Python now, as it does not have sketch capability so can only
>>> be used for outputting results directly from the pipeline.
>>>
>>
> If it is our intention to add the capability eventually, IMO it makes
> sense to mark the existing functionality deprecated in Python as well.
>
>
>>> *Reuven: I think this is the sort of thing that has been experimental
>>> forever, and therefore not experimental (e.g. the entire triggering API is
>>> experimental as are all our file-based sinks). I think that many users use
>>> this, and probably store the state implicitly in streaming pipelines.*
>>> True, I have an old action item to try and go through and PR against
>>> old @experimental annotations but need to find time. So for this
>>> discussion; I guess this should be marked as deprecated if we change it
>>> even though its @experimental.
>>>
>>
>> Agreed.
>>
>>
>>> *Rob: I'm not following this--by naming things after their
>>> implementation rather than their intent I think they will be harder to
>>> search for. *
>>> This is to add to the name the implementation, after the intent. For
>>> example ApproximateCountDistinctZetaSketch, I believe should be easy to
>>> search for and it is clear which implementation is used. Allowing for a
>>> potentially better implementation ApproximateCountDistinct<FooBar>.
>>>
>>
>> OK, if we have both I'm more OK with that. This is better than the names
>> like HllCount, which seems to be what was suggested.
>>
>> Another approach would be to have a required  parameter which is an enum
>>> of the implementation options.
>>> ApproximateCountDistinct.of().usingImpl(ZETA) ?
>>>
>>
>> Ideally this could be an optional parameter, or possibly only required
>> during update until we figure out a good way for the runner to plug this in
>> appropreately.
>>
>> Rob/Kenn: On Combiner discussion, should we tie action items from the
>>> needs of this thread to this larger discussion?
>>>
>>> Cheers
>>> Reza
>>>
>>> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles <ke...@apache.org>
>>>> wrote:
>>>>
>>>>> Wow. Nice summary, yes. Major calls to action:
>>>>>
>>>>> 0. Never allow a combiner that does not include the format of its
>>>>> state clear in its name/URN. The "update compatibility" problem makes their
>>>>> internal accumulator state essentially part of their public API. Combiners
>>>>> named for what they do are an inherent risk, since we might have a new way
>>>>> to do the same operation with different implementation-detail state.
>>>>>
>>>>
>>>> It seems this will make for a worse user experience, motivated solely
>>>> by limitations in our implementation. I think we can do better.
>>>> Hypothetical idea: what if upgrade required access to the original graph
>>>> (or at least metadata about it) during construction? In this case an
>>>> ApproximateDistinct could look at what was used last time and try to do the
>>>> same, but be free to do something better when unconstrained. Another
>>>> approach would be to encode several alternative expansions in the Beam
>>>> graph and let the runner do the picking (based on prior submission).
>>>> (Making the CombineFn, as opposed to the composite, have several
>>>> alternatives seems harder to reason about, but maybe worth pursuing as
>>>> well).
>>>>
>>>> This is not unique to Combiners, but any stateful DoFn, or composite
>>>> operations with non-trivial internal structure (and coders). This has been
>>>> discussed a lot, perhaps there are some ideas there we could borrow?
>>>>
>>>> And they will match search terms better, which is a major problem.
>>>>>
>>>>
>>>> I'm not following this--by naming things after their implementation
>>>> rather than their intent I think they will be harder to search for.
>>>>
>>>>
>>>>> 1. Point users to HllCount. This seems to be the best of the three.
>>>>> Does it have a name that is clear enough about the format of its state?
>>>>> Noting that its Java package name includes zetasketch, perhaps.
>>>>> 2. Deprecate the others, at least. And remove them from e.g. Javadoc.
>>>>>
>>>>
>>>> +1
>>>>
>>>>
>>>>> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <re...@google.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <al...@google.com> wrote:
>>>>>>
>>>>>>> Thank you for writing this summary.
>>>>>>>
>>>>>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <re...@google.com> wrote:
>>>>>>>
>>>>>>>> Hi everyone;
>>>>>>>>
>>>>>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>>>>>>>> algorithms.
>>>>>>>>
>>>>>>>> Today there are several options for Approximate Algorithms in
>>>>>>>> Apache Beam 2.16 with HLLCount being the most recently added. Would like to
>>>>>>>> canvas opinions here on the possibility of rationalizing these API's by
>>>>>>>> removing obsolete / less efficient implementations.
>>>>>>>> The current situation:
>>>>>>>>
>>>>>>>> There are three options available to users: ApproximateUnique.java
>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
>>>>>>>> ApproximateDistinct.java
>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>> and HllCount.java
>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
>>>>>>>> A quick summary of these API's as I understand them:
>>>>>>>>
>>>>>>>> HllCount.java
>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
>>>>>>>> Marked as @Experimental
>>>>>>>>
>>>>>>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data
>>>>>>>> streams based on the ZetaSketch
>>>>>>>> <https://github.com/google/zetasketch> implementation.Detailed
>>>>>>>> design of this class, see https://s.apache.org/hll-in-beam.
>>>>>>>>
>>>>>>>> ApproximateUnique.java
>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
>>>>>>>> Not Marked with experimental
>>>>>>>>
>>>>>>>> This API does not expose the ability to create sketches so it's not
>>>>>>>> suitable for the OLAP use case that HLL++ is geared towards (do
>>>>>>>> pre-aggregation into sketches to allow interactive query speeds). It's also
>>>>>>>> less precise for the same amount of memory used: the error bounds in the
>>>>>>>> doc comments give :
>>>>>>>>
>>>>>>>> /* The error is about
>>>>>>>>
>>>>>>>> {@code 2 * / sqrt(sampleSize)},) */
>>>>>>>>
>>>>>>>> Compared to the default HLLCount sketch size, its error is 10X
>>>>>>>> larger than the HLL++ error.
>>>>>>>>
>>>>>>>
>>>>>>> FWIW, There is a python implementation only for this version:
>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> ApproximateDistinct.java
>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>> Marked with @Experimental
>>>>>>>>
>>>>>>>> This is a re-implementation of the HLL++ algorithm, based on the
>>>>>>>> paper published in 2013. It is exposing sketches via a
>>>>>>>> HyperLogLogPlusCoder. We have not run any benchmarks to compare this
>>>>>>>> implementation compared to the HLLCount and we need to be careful to ensure
>>>>>>>> that if we were to change any of these API's that the binary format of the
>>>>>>>> sketches should never change, there could be users who have stored previous
>>>>>>>> sketches using ApproximateDistinct and it will be important to try and
>>>>>>>> ensure they do not have a bad experience.
>>>>>>>>
>>>>>>>>
>>>>>>>> Proposals:
>>>>>>>>
>>>>>>>> There are two classes of users expected for these algorithms:
>>>>>>>>
>>>>>>>> 1) Users who simply use the transform to estimate the size of their
>>>>>>>> data set in Beam
>>>>>>>>
>>>>>>>> 2) Users who want to create sketches and store them, either for
>>>>>>>> interoperability with other systems, or as features to be used in further
>>>>>>>> data processing.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> For use case 1, it is possible to make use of naming which does not
>>>>>>>> expose the implementation, however for use case 2 it is important for the
>>>>>>>> implementation to be explicit as sketches produced with one implementation
>>>>>>>> will not work with other implementations.
>>>>>>>>
>>>>>>>> ApproximateUnique.java
>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
>>>>>>>> :
>>>>>>>>
>>>>>>>> This one does not provide sketches and based on the notes above, is
>>>>>>>> not as efficient as HLLCount. However it does have a very searchable name
>>>>>>>> and is likely to be the one users will gravitate to when searching for
>>>>>>>> Approximate unique algorithms but do not need the capabilities of sketches.
>>>>>>>>
>>>>>>>> Ideally we should think about switching the implementation of this
>>>>>>>> transform to wrap HLLCount. However this could mean changing things in a
>>>>>>>> way which is not transparent to the end developer.  Although as a result of
>>>>>>>> the change they would get a better implementation for free on an upgrade :-)
>>>>>>>>
>>>>>>>> Another option would be to mark this transform as @Deprecated and
>>>>>>>> create a new transform ApproximateCountDistinct which would wrap
>>>>>>>> HLLCount. The name is also much clearer.
>>>>>>>>
>>>>>>>
>>>>>>> Marking it deprecated instead of changing its implementation will
>>>>>>> probably create a less surprising experience for the users.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> ApproximateDistinct.java
>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>>
>>>>>>>> This transform does generate sketches as output and given its
>>>>>>>> marked as @Experimental, one option we would have is to create a name which
>>>>>>>> includes the algorithm implementation details, for example
>>>>>>>> ApproximateCountDistinctClearSpring.
>>>>>>>>
>>>>>>>
>>>>>>> Can we remove this, if it is both experimental and providing the
>>>>>>> same utility as HllCount? Is the concern that users might have stored
>>>>>>> sketches?
>>>>>>>
>>>>>>
>>>>>> I think this is the sort of thing that has been experimental forever,
>>>>>> and therefore not experimental (e.g. the entire triggering API is
>>>>>> experimental as are all our file-based sinks). I think that many users use
>>>>>> this, and probably store the state implicitly in streaming pipelines.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> HllCount.java
>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
>>>>>>>> .
>>>>>>>>
>>>>>>>> Again we have a few options here, as the name does not include
>>>>>>>> search words like Approximate, we can create a wrapper which has a name
>>>>>>>> similar to ApproximateCountDistinctZetaSketch.
>>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Reza
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> This email may be confidential and privileged. If you received this
>>>>>>>> communication by mistake, please don't forward it to anyone else, please
>>>>>>>> erase all copies and attachments, and please let me know that it has gone
>>>>>>>> to the wrong person.
>>>>>>>>
>>>>>>>> The above terms reflect a potential business arrangement, are
>>>>>>>> provided solely as a basis for further discussion, and are not intended to
>>>>>>>> be and do not constitute a legally binding obligation. No legally binding
>>>>>>>> obligations will be created, implied, or inferred until an agreement in
>>>>>>>> final form is executed in writing by all parties involved.
>>>>>>>>
>>>>>>>
>>>
>>> --
>>>
>>> This email may be confidential and privileged. If you received this
>>> communication by mistake, please don't forward it to anyone else, please
>>> erase all copies and attachments, and please let me know that it has gone
>>> to the wrong person.
>>>
>>> The above terms reflect a potential business arrangement, are provided
>>> solely as a basis for further discussion, and are not intended to be and do
>>> not constitute a legally binding obligation. No legally binding obligations
>>> will be created, implied, or inferred until an agreement in final form is
>>> executed in writing by all parties involved.
>>>
>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Cleaning up Approximate Algorithms in Beam

Posted by Ahmet Altay <al...@google.com>.
On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw <ro...@google.com>
wrote:

> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni <re...@google.com> wrote:
>
>> *Ahmet: FWIW, There is a python implementation only for this
>> version: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38> *
>> Eventually we will be able to make use of cross language transforms to
>> help with feature parity. Until then, are we ok with marking this
>> deprecated in python, even though we do not have another solution. Or leave
>> it as is in Python now, as it does not have sketch capability so can only
>> be used for outputting results directly from the pipeline.
>>
>
If it is our intention to add the capability eventually, IMO it makes sense
to mark the existing functionality deprecated in Python as well.


>> *Reuven: I think this is the sort of thing that has been experimental
>> forever, and therefore not experimental (e.g. the entire triggering API is
>> experimental as are all our file-based sinks). I think that many users use
>> this, and probably store the state implicitly in streaming pipelines.*
>> True, I have an old action item to try and go through and PR against
>> old @experimental annotations but need to find time. So for this
>> discussion; I guess this should be marked as deprecated if we change it
>> even though its @experimental.
>>
>
> Agreed.
>
>
>> *Rob: I'm not following this--by naming things after their implementation
>> rather than their intent I think they will be harder to search for. *
>> This is to add to the name the implementation, after the intent. For
>> example ApproximateCountDistinctZetaSketch, I believe should be easy to
>> search for and it is clear which implementation is used. Allowing for a
>> potentially better implementation ApproximateCountDistinct<FooBar>.
>>
>
> OK, if we have both I'm more OK with that. This is better than the names
> like HllCount, which seems to be what was suggested.
>
> Another approach would be to have a required  parameter which is an enum
>> of the implementation options.
>> ApproximateCountDistinct.of().usingImpl(ZETA) ?
>>
>
> Ideally this could be an optional parameter, or possibly only required
> during update until we figure out a good way for the runner to plug this in
> appropreately.
>
> Rob/Kenn: On Combiner discussion, should we tie action items from the
>> needs of this thread to this larger discussion?
>>
>> Cheers
>> Reza
>>
>> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles <ke...@apache.org> wrote:
>>>
>>>> Wow. Nice summary, yes. Major calls to action:
>>>>
>>>> 0. Never allow a combiner that does not include the format of its state
>>>> clear in its name/URN. The "update compatibility" problem makes their
>>>> internal accumulator state essentially part of their public API. Combiners
>>>> named for what they do are an inherent risk, since we might have a new way
>>>> to do the same operation with different implementation-detail state.
>>>>
>>>
>>> It seems this will make for a worse user experience, motivated solely by
>>> limitations in our implementation. I think we can do better. Hypothetical
>>> idea: what if upgrade required access to the original graph (or at least
>>> metadata about it) during construction? In this case an ApproximateDistinct
>>> could look at what was used last time and try to do the same, but be free
>>> to do something better when unconstrained. Another approach would be to
>>> encode several alternative expansions in the Beam graph and let the runner
>>> do the picking (based on prior submission). (Making the CombineFn, as
>>> opposed to the composite, have several alternatives seems harder to reason
>>> about, but maybe worth pursuing as well).
>>>
>>> This is not unique to Combiners, but any stateful DoFn, or composite
>>> operations with non-trivial internal structure (and coders). This has been
>>> discussed a lot, perhaps there are some ideas there we could borrow?
>>>
>>> And they will match search terms better, which is a major problem.
>>>>
>>>
>>> I'm not following this--by naming things after their implementation
>>> rather than their intent I think they will be harder to search for.
>>>
>>>
>>>> 1. Point users to HllCount. This seems to be the best of the three.
>>>> Does it have a name that is clear enough about the format of its state?
>>>> Noting that its Java package name includes zetasketch, perhaps.
>>>> 2. Deprecate the others, at least. And remove them from e.g. Javadoc.
>>>>
>>>
>>> +1
>>>
>>>
>>>> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <al...@google.com> wrote:
>>>>>
>>>>>> Thank you for writing this summary.
>>>>>>
>>>>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <re...@google.com> wrote:
>>>>>>
>>>>>>> Hi everyone;
>>>>>>>
>>>>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>>>>>>> algorithms.
>>>>>>>
>>>>>>> Today there are several options for Approximate Algorithms in Apache
>>>>>>> Beam 2.16 with HLLCount being the most recently added. Would like to canvas
>>>>>>> opinions here on the possibility of rationalizing these API's by removing
>>>>>>> obsolete / less efficient implementations.
>>>>>>> The current situation:
>>>>>>>
>>>>>>> There are three options available to users: ApproximateUnique.java
>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
>>>>>>> ApproximateDistinct.java
>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>> and HllCount.java
>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
>>>>>>> A quick summary of these API's as I understand them:
>>>>>>>
>>>>>>> HllCount.java
>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
>>>>>>> Marked as @Experimental
>>>>>>>
>>>>>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data
>>>>>>> streams based on the ZetaSketch
>>>>>>> <https://github.com/google/zetasketch> implementation.Detailed
>>>>>>> design of this class, see https://s.apache.org/hll-in-beam.
>>>>>>>
>>>>>>> ApproximateUnique.java
>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
>>>>>>> Not Marked with experimental
>>>>>>>
>>>>>>> This API does not expose the ability to create sketches so it's not
>>>>>>> suitable for the OLAP use case that HLL++ is geared towards (do
>>>>>>> pre-aggregation into sketches to allow interactive query speeds). It's also
>>>>>>> less precise for the same amount of memory used: the error bounds in the
>>>>>>> doc comments give :
>>>>>>>
>>>>>>> /* The error is about
>>>>>>>
>>>>>>> {@code 2 * / sqrt(sampleSize)},) */
>>>>>>>
>>>>>>> Compared to the default HLLCount sketch size, its error is 10X
>>>>>>> larger than the HLL++ error.
>>>>>>>
>>>>>>
>>>>>> FWIW, There is a python implementation only for this version:
>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>>>>
>>>>>>
>>>>>>
>>>>>>> ApproximateDistinct.java
>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>> Marked with @Experimental
>>>>>>>
>>>>>>> This is a re-implementation of the HLL++ algorithm, based on the
>>>>>>> paper published in 2013. It is exposing sketches via a
>>>>>>> HyperLogLogPlusCoder. We have not run any benchmarks to compare this
>>>>>>> implementation compared to the HLLCount and we need to be careful to ensure
>>>>>>> that if we were to change any of these API's that the binary format of the
>>>>>>> sketches should never change, there could be users who have stored previous
>>>>>>> sketches using ApproximateDistinct and it will be important to try and
>>>>>>> ensure they do not have a bad experience.
>>>>>>>
>>>>>>>
>>>>>>> Proposals:
>>>>>>>
>>>>>>> There are two classes of users expected for these algorithms:
>>>>>>>
>>>>>>> 1) Users who simply use the transform to estimate the size of their
>>>>>>> data set in Beam
>>>>>>>
>>>>>>> 2) Users who want to create sketches and store them, either for
>>>>>>> interoperability with other systems, or as features to be used in further
>>>>>>> data processing.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> For use case 1, it is possible to make use of naming which does not
>>>>>>> expose the implementation, however for use case 2 it is important for the
>>>>>>> implementation to be explicit as sketches produced with one implementation
>>>>>>> will not work with other implementations.
>>>>>>>
>>>>>>> ApproximateUnique.java
>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
>>>>>>> :
>>>>>>>
>>>>>>> This one does not provide sketches and based on the notes above, is
>>>>>>> not as efficient as HLLCount. However it does have a very searchable name
>>>>>>> and is likely to be the one users will gravitate to when searching for
>>>>>>> Approximate unique algorithms but do not need the capabilities of sketches.
>>>>>>>
>>>>>>> Ideally we should think about switching the implementation of this
>>>>>>> transform to wrap HLLCount. However this could mean changing things in a
>>>>>>> way which is not transparent to the end developer.  Although as a result of
>>>>>>> the change they would get a better implementation for free on an upgrade :-)
>>>>>>>
>>>>>>> Another option would be to mark this transform as @Deprecated and
>>>>>>> create a new transform ApproximateCountDistinct which would wrap
>>>>>>> HLLCount. The name is also much clearer.
>>>>>>>
>>>>>>
>>>>>> Marking it deprecated instead of changing its implementation will
>>>>>> probably create a less surprising experience for the users.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> ApproximateDistinct.java
>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>
>>>>>>> This transform does generate sketches as output and given its marked
>>>>>>> as @Experimental, one option we would have is to create a name which
>>>>>>> includes the algorithm implementation details, for example
>>>>>>> ApproximateCountDistinctClearSpring.
>>>>>>>
>>>>>>
>>>>>> Can we remove this, if it is both experimental and providing the same
>>>>>> utility as HllCount? Is the concern that users might have stored sketches?
>>>>>>
>>>>>
>>>>> I think this is the sort of thing that has been experimental forever,
>>>>> and therefore not experimental (e.g. the entire triggering API is
>>>>> experimental as are all our file-based sinks). I think that many users use
>>>>> this, and probably store the state implicitly in streaming pipelines.
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> HllCount.java
>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
>>>>>>> .
>>>>>>>
>>>>>>> Again we have a few options here, as the name does not include
>>>>>>> search words like Approximate, we can create a wrapper which has a name
>>>>>>> similar to ApproximateCountDistinctZetaSketch.
>>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thoughts?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Reza
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> This email may be confidential and privileged. If you received this
>>>>>>> communication by mistake, please don't forward it to anyone else, please
>>>>>>> erase all copies and attachments, and please let me know that it has gone
>>>>>>> to the wrong person.
>>>>>>>
>>>>>>> The above terms reflect a potential business arrangement, are
>>>>>>> provided solely as a basis for further discussion, and are not intended to
>>>>>>> be and do not constitute a legally binding obligation. No legally binding
>>>>>>> obligations will be created, implied, or inferred until an agreement in
>>>>>>> final form is executed in writing by all parties involved.
>>>>>>>
>>>>>>
>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>

Re: Cleaning up Approximate Algorithms in Beam

Posted by Robert Bradshaw <ro...@google.com>.
On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni <re...@google.com> wrote:

> *Ahmet: FWIW, There is a python implementation only for this
> version: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38> *
> Eventually we will be able to make use of cross language transforms to
> help with feature parity. Until then, are we ok with marking this
> deprecated in python, even though we do not have another solution. Or leave
> it as is in Python now, as it does not have sketch capability so can only
> be used for outputting results directly from the pipeline.
>
> *Reuven: I think this is the sort of thing that has been experimental
> forever, and therefore not experimental (e.g. the entire triggering API is
> experimental as are all our file-based sinks). I think that many users use
> this, and probably store the state implicitly in streaming pipelines.*
> True, I have an old action item to try and go through and PR against
> old @experimental annotations but need to find time. So for this
> discussion; I guess this should be marked as deprecated if we change it
> even though its @experimental.
>

Agreed.


> *Rob: I'm not following this--by naming things after their implementation
> rather than their intent I think they will be harder to search for. *
> This is to add to the name the implementation, after the intent. For
> example ApproximateCountDistinctZetaSketch, I believe should be easy to
> search for and it is clear which implementation is used. Allowing for a
> potentially better implementation ApproximateCountDistinct<FooBar>.
>

OK, if we have both I'm more OK with that. This is better than the names
like HllCount, which seems to be what was suggested.

Another approach would be to have a required  parameter which is an enum of
> the implementation options. ApproximateCountDistinct.of().usingImpl(ZETA) ?
>

Ideally this could be an optional parameter, or possibly only required
during update until we figure out a good way for the runner to plug this in
appropreately.

Rob/Kenn: On Combiner discussion, should we tie action items from the needs
> of this thread to this larger discussion?
>
> Cheers
> Reza
>
> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw <ro...@google.com> wrote:
>
>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles <ke...@apache.org> wrote:
>>
>>> Wow. Nice summary, yes. Major calls to action:
>>>
>>> 0. Never allow a combiner that does not include the format of its state
>>> clear in its name/URN. The "update compatibility" problem makes their
>>> internal accumulator state essentially part of their public API. Combiners
>>> named for what they do are an inherent risk, since we might have a new way
>>> to do the same operation with different implementation-detail state.
>>>
>>
>> It seems this will make for a worse user experience, motivated solely by
>> limitations in our implementation. I think we can do better. Hypothetical
>> idea: what if upgrade required access to the original graph (or at least
>> metadata about it) during construction? In this case an ApproximateDistinct
>> could look at what was used last time and try to do the same, but be free
>> to do something better when unconstrained. Another approach would be to
>> encode several alternative expansions in the Beam graph and let the runner
>> do the picking (based on prior submission). (Making the CombineFn, as
>> opposed to the composite, have several alternatives seems harder to reason
>> about, but maybe worth pursuing as well).
>>
>> This is not unique to Combiners, but any stateful DoFn, or composite
>> operations with non-trivial internal structure (and coders). This has been
>> discussed a lot, perhaps there are some ideas there we could borrow?
>>
>> And they will match search terms better, which is a major problem.
>>>
>>
>> I'm not following this--by naming things after their implementation
>> rather than their intent I think they will be harder to search for.
>>
>>
>>> 1. Point users to HllCount. This seems to be the best of the three. Does
>>> it have a name that is clear enough about the format of its state? Noting
>>> that its Java package name includes zetasketch, perhaps.
>>> 2. Deprecate the others, at least. And remove them from e.g. Javadoc.
>>>
>>
>> +1
>>
>>
>>> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <re...@google.com> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <al...@google.com> wrote:
>>>>
>>>>> Thank you for writing this summary.
>>>>>
>>>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <re...@google.com> wrote:
>>>>>
>>>>>> Hi everyone;
>>>>>>
>>>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>>>>>> algorithms.
>>>>>>
>>>>>> Today there are several options for Approximate Algorithms in Apache
>>>>>> Beam 2.16 with HLLCount being the most recently added. Would like to canvas
>>>>>> opinions here on the possibility of rationalizing these API's by removing
>>>>>> obsolete / less efficient implementations.
>>>>>> The current situation:
>>>>>>
>>>>>> There are three options available to users: ApproximateUnique.java
>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
>>>>>> ApproximateDistinct.java
>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>> and HllCount.java
>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
>>>>>> A quick summary of these API's as I understand them:
>>>>>>
>>>>>> HllCount.java
>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
>>>>>> Marked as @Experimental
>>>>>>
>>>>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data
>>>>>> streams based on the ZetaSketch
>>>>>> <https://github.com/google/zetasketch> implementation.Detailed
>>>>>> design of this class, see https://s.apache.org/hll-in-beam.
>>>>>>
>>>>>> ApproximateUnique.java
>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
>>>>>> Not Marked with experimental
>>>>>>
>>>>>> This API does not expose the ability to create sketches so it's not
>>>>>> suitable for the OLAP use case that HLL++ is geared towards (do
>>>>>> pre-aggregation into sketches to allow interactive query speeds). It's also
>>>>>> less precise for the same amount of memory used: the error bounds in the
>>>>>> doc comments give :
>>>>>>
>>>>>> /* The error is about
>>>>>>
>>>>>> {@code 2 * / sqrt(sampleSize)},) */
>>>>>>
>>>>>> Compared to the default HLLCount sketch size, its error is 10X larger
>>>>>> than the HLL++ error.
>>>>>>
>>>>>
>>>>> FWIW, There is a python implementation only for this version:
>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>>>
>>>>>
>>>>>
>>>>>> ApproximateDistinct.java
>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>> Marked with @Experimental
>>>>>>
>>>>>> This is a re-implementation of the HLL++ algorithm, based on the
>>>>>> paper published in 2013. It is exposing sketches via a
>>>>>> HyperLogLogPlusCoder. We have not run any benchmarks to compare this
>>>>>> implementation compared to the HLLCount and we need to be careful to ensure
>>>>>> that if we were to change any of these API's that the binary format of the
>>>>>> sketches should never change, there could be users who have stored previous
>>>>>> sketches using ApproximateDistinct and it will be important to try and
>>>>>> ensure they do not have a bad experience.
>>>>>>
>>>>>>
>>>>>> Proposals:
>>>>>>
>>>>>> There are two classes of users expected for these algorithms:
>>>>>>
>>>>>> 1) Users who simply use the transform to estimate the size of their
>>>>>> data set in Beam
>>>>>>
>>>>>> 2) Users who want to create sketches and store them, either for
>>>>>> interoperability with other systems, or as features to be used in further
>>>>>> data processing.
>>>>>>
>>>>>>
>>>>>>
>>>>>> For use case 1, it is possible to make use of naming which does not
>>>>>> expose the implementation, however for use case 2 it is important for the
>>>>>> implementation to be explicit as sketches produced with one implementation
>>>>>> will not work with other implementations.
>>>>>>
>>>>>> ApproximateUnique.java
>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
>>>>>> :
>>>>>>
>>>>>> This one does not provide sketches and based on the notes above, is
>>>>>> not as efficient as HLLCount. However it does have a very searchable name
>>>>>> and is likely to be the one users will gravitate to when searching for
>>>>>> Approximate unique algorithms but do not need the capabilities of sketches.
>>>>>>
>>>>>> Ideally we should think about switching the implementation of this
>>>>>> transform to wrap HLLCount. However this could mean changing things in a
>>>>>> way which is not transparent to the end developer.  Although as a result of
>>>>>> the change they would get a better implementation for free on an upgrade :-)
>>>>>>
>>>>>> Another option would be to mark this transform as @Deprecated and
>>>>>> create a new transform ApproximateCountDistinct which would wrap
>>>>>> HLLCount. The name is also much clearer.
>>>>>>
>>>>>
>>>>> Marking it deprecated instead of changing its implementation will
>>>>> probably create a less surprising experience for the users.
>>>>>
>>>>>
>>>>>>
>>>>>> ApproximateDistinct.java
>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>
>>>>>> This transform does generate sketches as output and given its marked
>>>>>> as @Experimental, one option we would have is to create a name which
>>>>>> includes the algorithm implementation details, for example
>>>>>> ApproximateCountDistinctClearSpring.
>>>>>>
>>>>>
>>>>> Can we remove this, if it is both experimental and providing the same
>>>>> utility as HllCount? Is the concern that users might have stored sketches?
>>>>>
>>>>
>>>> I think this is the sort of thing that has been experimental forever,
>>>> and therefore not experimental (e.g. the entire triggering API is
>>>> experimental as are all our file-based sinks). I think that many users use
>>>> this, and probably store the state implicitly in streaming pipelines.
>>>>
>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> HllCount.java
>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
>>>>>> .
>>>>>>
>>>>>> Again we have a few options here, as the name does not include search
>>>>>> words like Approximate, we can create a wrapper which has a name similar to
>>>>>> ApproximateCountDistinctZetaSketch.
>>>>>>
>>>>>
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>>
>>>>>>
>>>>>> Reza
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> This email may be confidential and privileged. If you received this
>>>>>> communication by mistake, please don't forward it to anyone else, please
>>>>>> erase all copies and attachments, and please let me know that it has gone
>>>>>> to the wrong person.
>>>>>>
>>>>>> The above terms reflect a potential business arrangement, are
>>>>>> provided solely as a basis for further discussion, and are not intended to
>>>>>> be and do not constitute a legally binding obligation. No legally binding
>>>>>> obligations will be created, implied, or inferred until an agreement in
>>>>>> final form is executed in writing by all parties involved.
>>>>>>
>>>>>
>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
>
> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and do
> not constitute a legally binding obligation. No legally binding obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>

Re: Cleaning up Approximate Algorithms in Beam

Posted by Reza Rokni <re...@google.com>.
*Ahmet: FWIW, There is a python implementation only for this
version: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
<https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38>
*
Eventually we will be able to make use of cross language transforms to help
with feature parity. Until then, are we ok with marking this deprecated in
python, even though we do not have another solution. Or leave it as is in
Python now, as it does not have sketch capability so can only be used for
outputting results directly from the pipeline.

*Reuven: I think this is the sort of thing that has been experimental
forever, and therefore not experimental (e.g. the entire triggering API is
experimental as are all our file-based sinks). I think that many users use
this, and probably store the state implicitly in streaming pipelines.*
True, I have an old action item to try and go through and PR against
old @experimental annotations but need to find time. So for this
discussion; I guess this should be marked as deprecated if we change it
even though its @experimental.

*Rob: I'm not following this--by naming things after their implementation
rather than their intent I think they will be harder to search for. *
This is to add to the name the implementation, after the intent. For
example ApproximateCountDistinctZetaSketch, I believe should be easy to
search for and it is clear which implementation is used. Allowing for a
potentially better implementation ApproximateCountDistinct<FooBar>. Another
approach would be to have a required  parameter which is an enum of the
implementation options. ApproximateCountDistinct.of().usingImpl(ZETA) ?

Rob/Kenn: On Combiner discussion, should we tie action items from the needs
of this thread to this larger discussion?

Cheers
Reza

On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw <ro...@google.com> wrote:

> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles <ke...@apache.org> wrote:
>
>> Wow. Nice summary, yes. Major calls to action:
>>
>> 0. Never allow a combiner that does not include the format of its state
>> clear in its name/URN. The "update compatibility" problem makes their
>> internal accumulator state essentially part of their public API. Combiners
>> named for what they do are an inherent risk, since we might have a new way
>> to do the same operation with different implementation-detail state.
>>
>
> It seems this will make for a worse user experience, motivated solely by
> limitations in our implementation. I think we can do better. Hypothetical
> idea: what if upgrade required access to the original graph (or at least
> metadata about it) during construction? In this case an ApproximateDistinct
> could look at what was used last time and try to do the same, but be free
> to do something better when unconstrained. Another approach would be to
> encode several alternative expansions in the Beam graph and let the runner
> do the picking (based on prior submission). (Making the CombineFn, as
> opposed to the composite, have several alternatives seems harder to reason
> about, but maybe worth pursuing as well).
>
> This is not unique to Combiners, but any stateful DoFn, or composite
> operations with non-trivial internal structure (and coders). This has been
> discussed a lot, perhaps there are some ideas there we could borrow?
>
> And they will match search terms better, which is a major problem.
>>
>
> I'm not following this--by naming things after their implementation rather
> than their intent I think they will be harder to search for.
>
>
>> 1. Point users to HllCount. This seems to be the best of the three. Does
>> it have a name that is clear enough about the format of its state? Noting
>> that its Java package name includes zetasketch, perhaps.
>> 2. Deprecate the others, at least. And remove them from e.g. Javadoc.
>>
>
> +1
>
>
>> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <re...@google.com> wrote:
>>
>>>
>>>
>>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <al...@google.com> wrote:
>>>
>>>> Thank you for writing this summary.
>>>>
>>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <re...@google.com> wrote:
>>>>
>>>>> Hi everyone;
>>>>>
>>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>>>>> algorithms.
>>>>>
>>>>> Today there are several options for Approximate Algorithms in Apache
>>>>> Beam 2.16 with HLLCount being the most recently added. Would like to canvas
>>>>> opinions here on the possibility of rationalizing these API's by removing
>>>>> obsolete / less efficient implementations.
>>>>> The current situation:
>>>>>
>>>>> There are three options available to users: ApproximateUnique.java
>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
>>>>> ApproximateDistinct.java
>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>> and HllCount.java
>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
>>>>> A quick summary of these API's as I understand them:
>>>>>
>>>>> HllCount.java
>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
>>>>> Marked as @Experimental
>>>>>
>>>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data
>>>>> streams based on the ZetaSketch <https://github.com/google/zetasketch>
>>>>> implementation.Detailed design of this class, see
>>>>> https://s.apache.org/hll-in-beam.
>>>>>
>>>>> ApproximateUnique.java
>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
>>>>> Not Marked with experimental
>>>>>
>>>>> This API does not expose the ability to create sketches so it's not
>>>>> suitable for the OLAP use case that HLL++ is geared towards (do
>>>>> pre-aggregation into sketches to allow interactive query speeds). It's also
>>>>> less precise for the same amount of memory used: the error bounds in the
>>>>> doc comments give :
>>>>>
>>>>> /* The error is about
>>>>>
>>>>> {@code 2 * / sqrt(sampleSize)},) */
>>>>>
>>>>> Compared to the default HLLCount sketch size, its error is 10X larger
>>>>> than the HLL++ error.
>>>>>
>>>>
>>>> FWIW, There is a python implementation only for this version:
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>>
>>>>
>>>>
>>>>> ApproximateDistinct.java
>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>> Marked with @Experimental
>>>>>
>>>>> This is a re-implementation of the HLL++ algorithm, based on the paper
>>>>> published in 2013. It is exposing sketches via a HyperLogLogPlusCoder. We
>>>>> have not run any benchmarks to compare this implementation compared to the
>>>>> HLLCount and we need to be careful to ensure that if we were to change any
>>>>> of these API's that the binary format of the sketches should never change,
>>>>> there could be users who have stored previous sketches using
>>>>> ApproximateDistinct and it will be important to try and ensure they do not
>>>>> have a bad experience.
>>>>>
>>>>>
>>>>> Proposals:
>>>>>
>>>>> There are two classes of users expected for these algorithms:
>>>>>
>>>>> 1) Users who simply use the transform to estimate the size of their
>>>>> data set in Beam
>>>>>
>>>>> 2) Users who want to create sketches and store them, either for
>>>>> interoperability with other systems, or as features to be used in further
>>>>> data processing.
>>>>>
>>>>>
>>>>>
>>>>> For use case 1, it is possible to make use of naming which does not
>>>>> expose the implementation, however for use case 2 it is important for the
>>>>> implementation to be explicit as sketches produced with one implementation
>>>>> will not work with other implementations.
>>>>>
>>>>> ApproximateUnique.java
>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
>>>>> :
>>>>>
>>>>> This one does not provide sketches and based on the notes above, is
>>>>> not as efficient as HLLCount. However it does have a very searchable name
>>>>> and is likely to be the one users will gravitate to when searching for
>>>>> Approximate unique algorithms but do not need the capabilities of sketches.
>>>>>
>>>>> Ideally we should think about switching the implementation of this
>>>>> transform to wrap HLLCount. However this could mean changing things in a
>>>>> way which is not transparent to the end developer.  Although as a result of
>>>>> the change they would get a better implementation for free on an upgrade :-)
>>>>>
>>>>> Another option would be to mark this transform as @Deprecated and
>>>>> create a new transform ApproximateCountDistinct which would wrap
>>>>> HLLCount. The name is also much clearer.
>>>>>
>>>>
>>>> Marking it deprecated instead of changing its implementation will
>>>> probably create a less surprising experience for the users.
>>>>
>>>>
>>>>>
>>>>> ApproximateDistinct.java
>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>
>>>>> This transform does generate sketches as output and given its marked
>>>>> as @Experimental, one option we would have is to create a name which
>>>>> includes the algorithm implementation details, for example
>>>>> ApproximateCountDistinctClearSpring.
>>>>>
>>>>
>>>> Can we remove this, if it is both experimental and providing the same
>>>> utility as HllCount? Is the concern that users might have stored sketches?
>>>>
>>>
>>> I think this is the sort of thing that has been experimental forever,
>>> and therefore not experimental (e.g. the entire triggering API is
>>> experimental as are all our file-based sinks). I think that many users use
>>> this, and probably store the state implicitly in streaming pipelines.
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>> HllCount.java
>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
>>>>> .
>>>>>
>>>>> Again we have a few options here, as the name does not include search
>>>>> words like Approximate, we can create a wrapper which has a name similar to
>>>>> ApproximateCountDistinctZetaSketch.
>>>>>
>>>>
>>>>>
>>>>> Thoughts?
>>>>>
>>>>>
>>>>>
>>>>> Cheers
>>>>>
>>>>>
>>>>>
>>>>> Reza
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> This email may be confidential and privileged. If you received this
>>>>> communication by mistake, please don't forward it to anyone else, please
>>>>> erase all copies and attachments, and please let me know that it has gone
>>>>> to the wrong person.
>>>>>
>>>>> The above terms reflect a potential business arrangement, are provided
>>>>> solely as a basis for further discussion, and are not intended to be and do
>>>>> not constitute a legally binding obligation. No legally binding obligations
>>>>> will be created, implied, or inferred until an agreement in final form is
>>>>> executed in writing by all parties involved.
>>>>>
>>>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Cleaning up Approximate Algorithms in Beam

Posted by Robert Bradshaw <ro...@google.com>.
On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles <ke...@apache.org> wrote:

> Wow. Nice summary, yes. Major calls to action:
>
> 0. Never allow a combiner that does not include the format of its state
> clear in its name/URN. The "update compatibility" problem makes their
> internal accumulator state essentially part of their public API. Combiners
> named for what they do are an inherent risk, since we might have a new way
> to do the same operation with different implementation-detail state.
>

It seems this will make for a worse user experience, motivated solely by
limitations in our implementation. I think we can do better. Hypothetical
idea: what if upgrade required access to the original graph (or at least
metadata about it) during construction? In this case an ApproximateDistinct
could look at what was used last time and try to do the same, but be free
to do something better when unconstrained. Another approach would be to
encode several alternative expansions in the Beam graph and let the runner
do the picking (based on prior submission). (Making the CombineFn, as
opposed to the composite, have several alternatives seems harder to reason
about, but maybe worth pursuing as well).

This is not unique to Combiners, but any stateful DoFn, or composite
operations with non-trivial internal structure (and coders). This has been
discussed a lot, perhaps there are some ideas there we could borrow?

And they will match search terms better, which is a major problem.
>

I'm not following this--by naming things after their implementation rather
than their intent I think they will be harder to search for.


> 1. Point users to HllCount. This seems to be the best of the three. Does
> it have a name that is clear enough about the format of its state? Noting
> that its Java package name includes zetasketch, perhaps.
> 2. Deprecate the others, at least. And remove them from e.g. Javadoc.
>

+1


> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <re...@google.com> wrote:
>
>>
>>
>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <al...@google.com> wrote:
>>
>>> Thank you for writing this summary.
>>>
>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <re...@google.com> wrote:
>>>
>>>> Hi everyone;
>>>>
>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>>>> algorithms.
>>>>
>>>> Today there are several options for Approximate Algorithms in Apache
>>>> Beam 2.16 with HLLCount being the most recently added. Would like to canvas
>>>> opinions here on the possibility of rationalizing these API's by removing
>>>> obsolete / less efficient implementations.
>>>> The current situation:
>>>>
>>>> There are three options available to users: ApproximateUnique.java
>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
>>>> ApproximateDistinct.java
>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>> and HllCount.java
>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
>>>> A quick summary of these API's as I understand them:
>>>>
>>>> HllCount.java
>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
>>>> Marked as @Experimental
>>>>
>>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data
>>>> streams based on the ZetaSketch <https://github.com/google/zetasketch>
>>>> implementation.Detailed design of this class, see
>>>> https://s.apache.org/hll-in-beam.
>>>>
>>>> ApproximateUnique.java
>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
>>>> Not Marked with experimental
>>>>
>>>> This API does not expose the ability to create sketches so it's not
>>>> suitable for the OLAP use case that HLL++ is geared towards (do
>>>> pre-aggregation into sketches to allow interactive query speeds). It's also
>>>> less precise for the same amount of memory used: the error bounds in the
>>>> doc comments give :
>>>>
>>>> /* The error is about
>>>>
>>>> {@code 2 * / sqrt(sampleSize)},) */
>>>>
>>>> Compared to the default HLLCount sketch size, its error is 10X larger
>>>> than the HLL++ error.
>>>>
>>>
>>> FWIW, There is a python implementation only for this version:
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>
>>>
>>>
>>>> ApproximateDistinct.java
>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>> Marked with @Experimental
>>>>
>>>> This is a re-implementation of the HLL++ algorithm, based on the paper
>>>> published in 2013. It is exposing sketches via a HyperLogLogPlusCoder. We
>>>> have not run any benchmarks to compare this implementation compared to the
>>>> HLLCount and we need to be careful to ensure that if we were to change any
>>>> of these API's that the binary format of the sketches should never change,
>>>> there could be users who have stored previous sketches using
>>>> ApproximateDistinct and it will be important to try and ensure they do not
>>>> have a bad experience.
>>>>
>>>>
>>>> Proposals:
>>>>
>>>> There are two classes of users expected for these algorithms:
>>>>
>>>> 1) Users who simply use the transform to estimate the size of their
>>>> data set in Beam
>>>>
>>>> 2) Users who want to create sketches and store them, either for
>>>> interoperability with other systems, or as features to be used in further
>>>> data processing.
>>>>
>>>>
>>>>
>>>> For use case 1, it is possible to make use of naming which does not
>>>> expose the implementation, however for use case 2 it is important for the
>>>> implementation to be explicit as sketches produced with one implementation
>>>> will not work with other implementations.
>>>>
>>>> ApproximateUnique.java
>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
>>>> :
>>>>
>>>> This one does not provide sketches and based on the notes above, is not
>>>> as efficient as HLLCount. However it does have a very searchable name and
>>>> is likely to be the one users will gravitate to when searching for
>>>> Approximate unique algorithms but do not need the capabilities of sketches.
>>>>
>>>> Ideally we should think about switching the implementation of this
>>>> transform to wrap HLLCount. However this could mean changing things in a
>>>> way which is not transparent to the end developer.  Although as a result of
>>>> the change they would get a better implementation for free on an upgrade :-)
>>>>
>>>> Another option would be to mark this transform as @Deprecated and
>>>> create a new transform ApproximateCountDistinct which would wrap
>>>> HLLCount. The name is also much clearer.
>>>>
>>>
>>> Marking it deprecated instead of changing its implementation will
>>> probably create a less surprising experience for the users.
>>>
>>>
>>>>
>>>> ApproximateDistinct.java
>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>
>>>> This transform does generate sketches as output and given its marked as
>>>> @Experimental, one option we would have is to create a name which includes
>>>> the algorithm implementation details, for example
>>>> ApproximateCountDistinctClearSpring.
>>>>
>>>
>>> Can we remove this, if it is both experimental and providing the same
>>> utility as HllCount? Is the concern that users might have stored sketches?
>>>
>>
>> I think this is the sort of thing that has been experimental forever, and
>> therefore not experimental (e.g. the entire triggering API is experimental
>> as are all our file-based sinks). I think that many users use this, and
>> probably store the state implicitly in streaming pipelines.
>>
>>
>>>
>>>>
>>>>
>>>> HllCount.java
>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
>>>> .
>>>>
>>>> Again we have a few options here, as the name does not include search
>>>> words like Approximate, we can create a wrapper which has a name similar to
>>>> ApproximateCountDistinctZetaSketch.
>>>>
>>>
>>>>
>>>> Thoughts?
>>>>
>>>>
>>>>
>>>> Cheers
>>>>
>>>>
>>>>
>>>> Reza
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> This email may be confidential and privileged. If you received this
>>>> communication by mistake, please don't forward it to anyone else, please
>>>> erase all copies and attachments, and please let me know that it has gone
>>>> to the wrong person.
>>>>
>>>> The above terms reflect a potential business arrangement, are provided
>>>> solely as a basis for further discussion, and are not intended to be and do
>>>> not constitute a legally binding obligation. No legally binding obligations
>>>> will be created, implied, or inferred until an agreement in final form is
>>>> executed in writing by all parties involved.
>>>>
>>>

Re: Cleaning up Approximate Algorithms in Beam

Posted by Kenneth Knowles <ke...@apache.org>.
Wow. Nice summary, yes. Major calls to action:

0. Never allow a combiner that does not include the format of its state
clear in its name/URN. The "update compatibility" problem makes their
internal accumulator state essentially part of their public API. Combiners
named for what they do are an inherent risk, since we might have a new way
to do the same operation with different implementation-detail state. And
they will match search terms better, which is a major problem.

1. Point users to HllCount. This seems to be the best of the three. Does it
have a name that is clear enough about the format of its state? Noting that
its Java package name includes zetasketch, perhaps.

2. Deprecate the others, at least. And remove them from e.g. Javadoc.

Kenn

On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <re...@google.com> wrote:

>
>
> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <al...@google.com> wrote:
>
>> Thank you for writing this summary.
>>
>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <re...@google.com> wrote:
>>
>>> Hi everyone;
>>>
>>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>>> algorithms.
>>>
>>> Today there are several options for Approximate Algorithms in Apache
>>> Beam 2.16 with HLLCount being the most recently added. Would like to canvas
>>> opinions here on the possibility of rationalizing these API's by removing
>>> obsolete / less efficient implementations.
>>> The current situation:
>>>
>>> There are three options available to users: ApproximateUnique.java
>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
>>> ApproximateDistinct.java
>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>> and HllCount.java
>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
>>> A quick summary of these API's as I understand them:
>>>
>>> HllCount.java
>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
>>> Marked as @Experimental
>>>
>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data
>>> streams based on the ZetaSketch <https://github.com/google/zetasketch>
>>> implementation.Detailed design of this class, see
>>> https://s.apache.org/hll-in-beam.
>>>
>>> ApproximateUnique.java
>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
>>> Not Marked with experimental
>>>
>>> This API does not expose the ability to create sketches so it's not
>>> suitable for the OLAP use case that HLL++ is geared towards (do
>>> pre-aggregation into sketches to allow interactive query speeds). It's also
>>> less precise for the same amount of memory used: the error bounds in the
>>> doc comments give :
>>>
>>> /* The error is about
>>>
>>> {@code 2 * / sqrt(sampleSize)},) */
>>>
>>> Compared to the default HLLCount sketch size, its error is 10X larger
>>> than the HLL++ error.
>>>
>>
>> FWIW, There is a python implementation only for this version:
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>
>>
>>
>>> ApproximateDistinct.java
>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>> Marked with @Experimental
>>>
>>> This is a re-implementation of the HLL++ algorithm, based on the paper
>>> published in 2013. It is exposing sketches via a HyperLogLogPlusCoder. We
>>> have not run any benchmarks to compare this implementation compared to the
>>> HLLCount and we need to be careful to ensure that if we were to change any
>>> of these API's that the binary format of the sketches should never change,
>>> there could be users who have stored previous sketches using
>>> ApproximateDistinct and it will be important to try and ensure they do not
>>> have a bad experience.
>>>
>>>
>>> Proposals:
>>>
>>> There are two classes of users expected for these algorithms:
>>>
>>> 1) Users who simply use the transform to estimate the size of their data
>>> set in Beam
>>>
>>> 2) Users who want to create sketches and store them, either for
>>> interoperability with other systems, or as features to be used in further
>>> data processing.
>>>
>>>
>>>
>>> For use case 1, it is possible to make use of naming which does not
>>> expose the implementation, however for use case 2 it is important for the
>>> implementation to be explicit as sketches produced with one implementation
>>> will not work with other implementations.
>>>
>>> ApproximateUnique.java
>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
>>> :
>>>
>>> This one does not provide sketches and based on the notes above, is not
>>> as efficient as HLLCount. However it does have a very searchable name and
>>> is likely to be the one users will gravitate to when searching for
>>> Approximate unique algorithms but do not need the capabilities of sketches.
>>>
>>> Ideally we should think about switching the implementation of this
>>> transform to wrap HLLCount. However this could mean changing things in a
>>> way which is not transparent to the end developer.  Although as a result of
>>> the change they would get a better implementation for free on an upgrade :-)
>>>
>>> Another option would be to mark this transform as @Deprecated and create
>>> a new transform ApproximateCountDistinct which would wrap HLLCount. The
>>> name is also much clearer.
>>>
>>
>> Marking it deprecated instead of changing its implementation will
>> probably create a less surprising experience for the users.
>>
>>
>>>
>>> ApproximateDistinct.java
>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>
>>> This transform does generate sketches as output and given its marked as
>>> @Experimental, one option we would have is to create a name which includes
>>> the algorithm implementation details, for example
>>> ApproximateCountDistinctClearSpring.
>>>
>>
>> Can we remove this, if it is both experimental and providing the same
>> utility as HllCount? Is the concern that users might have stored sketches?
>>
>
> I think this is the sort of thing that has been experimental forever, and
> therefore not experimental (e.g. the entire triggering API is experimental
> as are all our file-based sinks). I think that many users use this, and
> probably store the state implicitly in streaming pipelines.
>
>
>>
>>>
>>>
>>> HllCount.java
>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
>>> .
>>>
>>> Again we have a few options here, as the name does not include search
>>> words like Approximate, we can create a wrapper which has a name similar to
>>> ApproximateCountDistinctZetaSketch.
>>>
>>
>>>
>>> Thoughts?
>>>
>>>
>>>
>>> Cheers
>>>
>>>
>>>
>>> Reza
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> This email may be confidential and privileged. If you received this
>>> communication by mistake, please don't forward it to anyone else, please
>>> erase all copies and attachments, and please let me know that it has gone
>>> to the wrong person.
>>>
>>> The above terms reflect a potential business arrangement, are provided
>>> solely as a basis for further discussion, and are not intended to be and do
>>> not constitute a legally binding obligation. No legally binding obligations
>>> will be created, implied, or inferred until an agreement in final form is
>>> executed in writing by all parties involved.
>>>
>>

Re: Cleaning up Approximate Algorithms in Beam

Posted by Reuven Lax <re...@google.com>.
On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <al...@google.com> wrote:

> Thank you for writing this summary.
>
> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <re...@google.com> wrote:
>
>> Hi everyone;
>>
>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>> algorithms.
>>
>> Today there are several options for Approximate Algorithms in Apache Beam
>> 2.16 with HLLCount being the most recently added. Would like to canvas
>> opinions here on the possibility of rationalizing these API's by removing
>> obsolete / less efficient implementations.
>> The current situation:
>>
>> There are three options available to users: ApproximateUnique.java
>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
>> ApproximateDistinct.java
>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>> and HllCount.java
>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
>> A quick summary of these API's as I understand them:
>>
>> HllCount.java
>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
>> Marked as @Experimental
>>
>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data
>> streams based on the ZetaSketch <https://github.com/google/zetasketch>
>> implementation.Detailed design of this class, see
>> https://s.apache.org/hll-in-beam.
>>
>> ApproximateUnique.java
>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
>> Not Marked with experimental
>>
>> This API does not expose the ability to create sketches so it's not
>> suitable for the OLAP use case that HLL++ is geared towards (do
>> pre-aggregation into sketches to allow interactive query speeds). It's also
>> less precise for the same amount of memory used: the error bounds in the
>> doc comments give :
>>
>> /* The error is about
>>
>> {@code 2 * / sqrt(sampleSize)},) */
>>
>> Compared to the default HLLCount sketch size, its error is 10X larger
>> than the HLL++ error.
>>
>
> FWIW, There is a python implementation only for this version:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>
>
>
>> ApproximateDistinct.java
>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>> Marked with @Experimental
>>
>> This is a re-implementation of the HLL++ algorithm, based on the paper
>> published in 2013. It is exposing sketches via a HyperLogLogPlusCoder. We
>> have not run any benchmarks to compare this implementation compared to the
>> HLLCount and we need to be careful to ensure that if we were to change any
>> of these API's that the binary format of the sketches should never change,
>> there could be users who have stored previous sketches using
>> ApproximateDistinct and it will be important to try and ensure they do not
>> have a bad experience.
>>
>>
>> Proposals:
>>
>> There are two classes of users expected for these algorithms:
>>
>> 1) Users who simply use the transform to estimate the size of their data
>> set in Beam
>>
>> 2) Users who want to create sketches and store them, either for
>> interoperability with other systems, or as features to be used in further
>> data processing.
>>
>>
>>
>> For use case 1, it is possible to make use of naming which does not
>> expose the implementation, however for use case 2 it is important for the
>> implementation to be explicit as sketches produced with one implementation
>> will not work with other implementations.
>>
>> ApproximateUnique.java
>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
>> :
>>
>> This one does not provide sketches and based on the notes above, is not
>> as efficient as HLLCount. However it does have a very searchable name and
>> is likely to be the one users will gravitate to when searching for
>> Approximate unique algorithms but do not need the capabilities of sketches.
>>
>> Ideally we should think about switching the implementation of this
>> transform to wrap HLLCount. However this could mean changing things in a
>> way which is not transparent to the end developer.  Although as a result of
>> the change they would get a better implementation for free on an upgrade :-)
>>
>> Another option would be to mark this transform as @Deprecated and create
>> a new transform ApproximateCountDistinct which would wrap HLLCount. The
>> name is also much clearer.
>>
>
> Marking it deprecated instead of changing its implementation will probably
> create a less surprising experience for the users.
>
>
>>
>> ApproximateDistinct.java
>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>
>> This transform does generate sketches as output and given its marked as
>> @Experimental, one option we would have is to create a name which includes
>> the algorithm implementation details, for example
>> ApproximateCountDistinctClearSpring.
>>
>
> Can we remove this, if it is both experimental and providing the same
> utility as HllCount? Is the concern that users might have stored sketches?
>

I think this is the sort of thing that has been experimental forever, and
therefore not experimental (e.g. the entire triggering API is experimental
as are all our file-based sinks). I think that many users use this, and
probably store the state implicitly in streaming pipelines.


>
>>
>>
>> HllCount.java
>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
>> .
>>
>> Again we have a few options here, as the name does not include search
>> words like Approximate, we can create a wrapper which has a name similar to
>> ApproximateCountDistinctZetaSketch.
>>
>
>>
>> Thoughts?
>>
>>
>>
>> Cheers
>>
>>
>>
>> Reza
>>
>>
>>
>>
>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>

Re: Cleaning up Approximate Algorithms in Beam

Posted by Ahmet Altay <al...@google.com>.
Thank you for writing this summary.

On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <re...@google.com> wrote:

> Hi everyone;
>
> TL/DR : Discussion on Beam's various Approximate Distinct Count algorithms.
>
> Today there are several options for Approximate Algorithms in Apache Beam
> 2.16 with HLLCount being the most recently added. Would like to canvas
> opinions here on the possibility of rationalizing these API's by removing
> obsolete / less efficient implementations.
> The current situation:
>
> There are three options available to users: ApproximateUnique.java
> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
> ApproximateDistinct.java
> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
> and HllCount.java
> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
> A quick summary of these API's as I understand them:
>
> HllCount.java
> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
> Marked as @Experimental
>
> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data
> streams based on the ZetaSketch <https://github.com/google/zetasketch>
> implementation.Detailed design of this class, see
> https://s.apache.org/hll-in-beam.
>
> ApproximateUnique.java
> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
> Not Marked with experimental
>
> This API does not expose the ability to create sketches so it's not
> suitable for the OLAP use case that HLL++ is geared towards (do
> pre-aggregation into sketches to allow interactive query speeds). It's also
> less precise for the same amount of memory used: the error bounds in the
> doc comments give :
>
> /* The error is about
>
> {@code 2 * / sqrt(sampleSize)},) */
>
> Compared to the default HLLCount sketch size, its error is 10X larger than
> the HLL++ error.
>

FWIW, There is a python implementation only for this version:
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38



> ApproximateDistinct.java
> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
> Marked with @Experimental
>
> This is a re-implementation of the HLL++ algorithm, based on the paper
> published in 2013. It is exposing sketches via a HyperLogLogPlusCoder. We
> have not run any benchmarks to compare this implementation compared to the
> HLLCount and we need to be careful to ensure that if we were to change any
> of these API's that the binary format of the sketches should never change,
> there could be users who have stored previous sketches using
> ApproximateDistinct and it will be important to try and ensure they do not
> have a bad experience.
>
>
> Proposals:
>
> There are two classes of users expected for these algorithms:
>
> 1) Users who simply use the transform to estimate the size of their data
> set in Beam
>
> 2) Users who want to create sketches and store them, either for
> interoperability with other systems, or as features to be used in further
> data processing.
>
>
>
> For use case 1, it is possible to make use of naming which does not expose
> the implementation, however for use case 2 it is important for the
> implementation to be explicit as sketches produced with one implementation
> will not work with other implementations.
>
> ApproximateUnique.java
> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
> :
>
> This one does not provide sketches and based on the notes above, is not as
> efficient as HLLCount. However it does have a very searchable name and is
> likely to be the one users will gravitate to when searching for Approximate
> unique algorithms but do not need the capabilities of sketches.
>
> Ideally we should think about switching the implementation of this
> transform to wrap HLLCount. However this could mean changing things in a
> way which is not transparent to the end developer.  Although as a result of
> the change they would get a better implementation for free on an upgrade :-)
>
> Another option would be to mark this transform as @Deprecated and create a
> new transform ApproximateCountDistinct which would wrap HLLCount. The
> name is also much clearer.
>

Marking it deprecated instead of changing its implementation will probably
create a less surprising experience for the users.


>
> ApproximateDistinct.java
> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>
> This transform does generate sketches as output and given its marked as
> @Experimental, one option we would have is to create a name which includes
> the algorithm implementation details, for example
> ApproximateCountDistinctClearSpring.
>

Can we remove this, if it is both experimental and providing the same
utility as HllCount? Is the concern that users might have stored sketches?


>
>
> HllCount.java
> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
> .
>
> Again we have a few options here, as the name does not include search
> words like Approximate, we can create a wrapper which has a name similar to
> ApproximateCountDistinctZetaSketch.
>

>
> Thoughts?
>
>
>
> Cheers
>
>
>
> Reza
>
>
>
>
>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
>
> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and do
> not constitute a legally binding obligation. No legally binding obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>