You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Reza Rokni <re...@google.com> on 2020/02/19 00:31:34 UTC
Re: Cleaning up Approximate Algorithms in Beam

Hi,

I will be making time for completing this in March, with completion and
reviews planned for April.

Cheers

Reza

On Wed, 27 Nov 2019 at 02:29, Robert Bradshaw <ro...@google.com> wrote:

> I think this thread is sufficient.
>
> On Mon, Nov 25, 2019 at 5:59 PM Reza Rokni <re...@google.com> wrote:
>
>> Hi,
>>
>> So do we need a vote for the final list of actions? Or is this thread
>> enough to go ahead and raise the PR's?
>>
>> Cheers
>>
>> Reza
>>
>> On Tue, 26 Nov 2019 at 06:01, Ahmet Altay <al...@google.com> wrote:
>>
>>>
>>>
>>> On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni <re...@google.com> wrote:
>>>>
>>>>> *Ahmet: FWIW, There is a python implementation only for this
>>>>> version: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38> *
>>>>> Eventually we will be able to make use of cross language transforms to
>>>>> help with feature parity. Until then, are we ok with marking this
>>>>> deprecated in python, even though we do not have another solution. Or leave
>>>>> it as is in Python now, as it does not have sketch capability so can only
>>>>> be used for outputting results directly from the pipeline.
>>>>>
>>>>
>>> If it is our intention to add the capability eventually, IMO it makes
>>> sense to mark the existing functionality deprecated in Python as well.
>>>
>>>
>>>>> *Reuven: I think this is the sort of thing that has been experimental
>>>>> forever, and therefore not experimental (e.g. the entire triggering API is
>>>>> experimental as are all our file-based sinks). I think that many users use
>>>>> this, and probably store the state implicitly in streaming pipelines.*
>>>>> True, I have an old action item to try and go through and PR against
>>>>> old @experimental annotations but need to find time. So for this
>>>>> discussion; I guess this should be marked as deprecated if we change it
>>>>> even though its @experimental.
>>>>>
>>>>
>>>> Agreed.
>>>>
>>>>
>>>>> *Rob: I'm not following this--by naming things after their
>>>>> implementation rather than their intent I think they will be harder to
>>>>> search for. *
>>>>> This is to add to the name the implementation, after the intent. For
>>>>> example ApproximateCountDistinctZetaSketch, I believe should be easy to
>>>>> search for and it is clear which implementation is used. Allowing for a
>>>>> potentially better implementation ApproximateCountDistinct<FooBar>.
>>>>>
>>>>
>>>> OK, if we have both I'm more OK with that. This is better than the
>>>> names like HllCount, which seems to be what was suggested.
>>>>
>>>> Another approach would be to have a required  parameter which is an
>>>>> enum of the implementation options.
>>>>> ApproximateCountDistinct.of().usingImpl(ZETA) ?
>>>>>
>>>>
>>>> Ideally this could be an optional parameter, or possibly only required
>>>> during update until we figure out a good way for the runner to plug this in
>>>> appropreately.
>>>>
>>>> Rob/Kenn: On Combiner discussion, should we tie action items from the
>>>>> needs of this thread to this larger discussion?
>>>>>
>>>>> Cheers
>>>>> Reza
>>>>>
>>>>> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles <ke...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Wow. Nice summary, yes. Major calls to action:
>>>>>>>
>>>>>>> 0. Never allow a combiner that does not include the format of its
>>>>>>> state clear in its name/URN. The "update compatibility" problem makes their
>>>>>>> internal accumulator state essentially part of their public API. Combiners
>>>>>>> named for what they do are an inherent risk, since we might have a new way
>>>>>>> to do the same operation with different implementation-detail state.
>>>>>>>
>>>>>>
>>>>>> It seems this will make for a worse user experience, motivated solely
>>>>>> by limitations in our implementation. I think we can do better.
>>>>>> Hypothetical idea: what if upgrade required access to the original graph
>>>>>> (or at least metadata about it) during construction? In this case an
>>>>>> ApproximateDistinct could look at what was used last time and try to do the
>>>>>> same, but be free to do something better when unconstrained. Another
>>>>>> approach would be to encode several alternative expansions in the Beam
>>>>>> graph and let the runner do the picking (based on prior submission).
>>>>>> (Making the CombineFn, as opposed to the composite, have several
>>>>>> alternatives seems harder to reason about, but maybe worth pursuing as
>>>>>> well).
>>>>>>
>>>>>> This is not unique to Combiners, but any stateful DoFn, or composite
>>>>>> operations with non-trivial internal structure (and coders). This has been
>>>>>> discussed a lot, perhaps there are some ideas there we could borrow?
>>>>>>
>>>>>> And they will match search terms better, which is a major problem.
>>>>>>>
>>>>>>
>>>>>> I'm not following this--by naming things after their implementation
>>>>>> rather than their intent I think they will be harder to search for.
>>>>>>
>>>>>>
>>>>>>> 1. Point users to HllCount. This seems to be the best of the three.
>>>>>>> Does it have a name that is clear enough about the format of its state?
>>>>>>> Noting that its Java package name includes zetasketch, perhaps.
>>>>>>> 2. Deprecate the others, at least. And remove them from e.g. Javadoc.
>>>>>>>
>>>>>>
>>>>>> +1
>>>>>>
>>>>>>
>>>>>>> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax <re...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay <al...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thank you for writing this summary.
>>>>>>>>>
>>>>>>>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni <re...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi everyone;
>>>>>>>>>>
>>>>>>>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>>>>>>>>>> algorithms.
>>>>>>>>>>
>>>>>>>>>> Today there are several options for Approximate Algorithms in
>>>>>>>>>> Apache Beam 2.16 with HLLCount being the most recently added. Would like to
>>>>>>>>>> canvas opinions here on the possibility of rationalizing these API's by
>>>>>>>>>> removing obsolete / less efficient implementations.
>>>>>>>>>> The current situation:
>>>>>>>>>>
>>>>>>>>>> There are three options available to users:
>>>>>>>>>> ApproximateUnique.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>,
>>>>>>>>>> ApproximateDistinct.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>>>> and HllCount.java
>>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>.
>>>>>>>>>> A quick summary of these API's as I understand them:
>>>>>>>>>>
>>>>>>>>>> HllCount.java
>>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>:
>>>>>>>>>> Marked as @Experimental
>>>>>>>>>>
>>>>>>>>>> PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on
>>>>>>>>>> data streams based on the ZetaSketch
>>>>>>>>>> <https://github.com/google/zetasketch> implementation.Detailed
>>>>>>>>>> design of this class, see https://s.apache.org/hll-in-beam.
>>>>>>>>>>
>>>>>>>>>> ApproximateUnique.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>:
>>>>>>>>>> Not Marked with experimental
>>>>>>>>>>
>>>>>>>>>> This API does not expose the ability to create sketches so it's
>>>>>>>>>> not suitable for the OLAP use case that HLL++ is geared towards (do
>>>>>>>>>> pre-aggregation into sketches to allow interactive query speeds). It's also
>>>>>>>>>> less precise for the same amount of memory used: the error bounds in the
>>>>>>>>>> doc comments give :
>>>>>>>>>>
>>>>>>>>>> /* The error is about
>>>>>>>>>>
>>>>>>>>>> {@code 2 * / sqrt(sampleSize)},) */
>>>>>>>>>>
>>>>>>>>>> Compared to the default HLLCount sketch size, its error is 10X
>>>>>>>>>> larger than the HLL++ error.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> FWIW, There is a python implementation only for this version:
>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> ApproximateDistinct.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>>>> Marked with @Experimental
>>>>>>>>>>
>>>>>>>>>> This is a re-implementation of the HLL++ algorithm, based on the
>>>>>>>>>> paper published in 2013. It is exposing sketches via a
>>>>>>>>>> HyperLogLogPlusCoder. We have not run any benchmarks to compare this
>>>>>>>>>> implementation compared to the HLLCount and we need to be careful to ensure
>>>>>>>>>> that if we were to change any of these API's that the binary format of the
>>>>>>>>>> sketches should never change, there could be users who have stored previous
>>>>>>>>>> sketches using ApproximateDistinct and it will be important to try and
>>>>>>>>>> ensure they do not have a bad experience.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Proposals:
>>>>>>>>>>
>>>>>>>>>> There are two classes of users expected for these algorithms:
>>>>>>>>>>
>>>>>>>>>> 1) Users who simply use the transform to estimate the size of
>>>>>>>>>> their data set in Beam
>>>>>>>>>>
>>>>>>>>>> 2) Users who want to create sketches and store them, either for
>>>>>>>>>> interoperability with other systems, or as features to be used in further
>>>>>>>>>> data processing.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> For use case 1, it is possible to make use of naming which does
>>>>>>>>>> not expose the implementation, however for use case 2 it is important for
>>>>>>>>>> the implementation to be explicit as sketches produced with one
>>>>>>>>>> implementation will not work with other implementations.
>>>>>>>>>>
>>>>>>>>>> ApproximateUnique.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java>
>>>>>>>>>> :
>>>>>>>>>>
>>>>>>>>>> This one does not provide sketches and based on the notes above,
>>>>>>>>>> is not as efficient as HLLCount. However it does have a very searchable
>>>>>>>>>> name and is likely to be the one users will gravitate to when searching for
>>>>>>>>>> Approximate unique algorithms but do not need the capabilities of sketches.
>>>>>>>>>>
>>>>>>>>>> Ideally we should think about switching the implementation of
>>>>>>>>>> this transform to wrap HLLCount. However this could mean changing things in
>>>>>>>>>> a way which is not transparent to the end developer.  Although as a result
>>>>>>>>>> of the change they would get a better implementation for free on an upgrade
>>>>>>>>>> :-)
>>>>>>>>>>
>>>>>>>>>> Another option would be to mark this transform as @Deprecated and
>>>>>>>>>> create a new transform ApproximateCountDistinct which would wrap
>>>>>>>>>> HLLCount. The name is also much clearer.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Marking it deprecated instead of changing its implementation will
>>>>>>>>> probably create a less surprising experience for the users.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ApproximateDistinct.java
>>>>>>>>>> <https://github.com/apache/beam/blob/058f5bc1d03e207ce5c8b9d221a231f390b088f3/sdks/java/extensions/sketching/src/main/java/org/apache/beam/sdk/extensions/sketching/ApproximateDistinct.java>
>>>>>>>>>>
>>>>>>>>>> This transform does generate sketches as output and given its
>>>>>>>>>> marked as @Experimental, one option we would have is to create a name which
>>>>>>>>>> includes the algorithm implementation details, for example
>>>>>>>>>> ApproximateCountDistinctClearSpring.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can we remove this, if it is both experimental and providing the
>>>>>>>>> same utility as HllCount? Is the concern that users might have stored
>>>>>>>>> sketches?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think this is the sort of thing that has been experimental
>>>>>>>> forever, and therefore not experimental (e.g. the entire triggering API is
>>>>>>>> experimental as are all our file-based sinks). I think that many users use
>>>>>>>> this, and probably store the state implicitly in streaming pipelines.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> HllCount.java
>>>>>>>>>> <https://beam.apache.org/releases/javadoc/2.16.0/org/apache/beam/sdk/extensions/zetasketch/HllCount.html>
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>> Again we have a few options here, as the name does not include
>>>>>>>>>> search words like Approximate, we can create a wrapper which has a name
>>>>>>>>>> similar to ApproximateCountDistinctZetaSketch.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thoughts?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Reza
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> This email may be confidential and privileged. If you received
>>>>>>>>>> this communication by mistake, please don't forward it to anyone else,
>>>>>>>>>> please erase all copies and attachments, and please let me know that it has
>>>>>>>>>> gone to the wrong person.
>>>>>>>>>>
>>>>>>>>>> The above terms reflect a potential business arrangement, are
>>>>>>>>>> provided solely as a basis for further discussion, and are not intended to
>>>>>>>>>> be and do not constitute a legally binding obligation. No legally binding
>>>>>>>>>> obligations will be created, implied, or inferred until an agreement in
>>>>>>>>>> final form is executed in writing by all parties involved.
>>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> This email may be confidential and privileged. If you received this
>>>>> communication by mistake, please don't forward it to anyone else, please
>>>>> erase all copies and attachments, and please let me know that it has gone
>>>>> to the wrong person.
>>>>>
>>>>> The above terms reflect a potential business arrangement, are provided
>>>>> solely as a basis for further discussion, and are not intended to be and do
>>>>> not constitute a legally binding obligation. No legally binding obligations
>>>>> will be created, implied, or inferred until an agreement in final form is
>>>>> executed in writing by all parties involved.
>>>>>
>>>>
>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>