You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Sunil Pedapudi <sk...@google.com> on 2021/02/02 07:50:02 UTC

Re: PTransform Annotations Proposal

Thanks very much, Mirac. I added a few comments as well, and the proposal looks good to me. Thanks for all the discussion.

On 2021/01/14 20:24:19, Robert Bradshaw <ro...@google.com> wrote: 
> Thanks. I added some comments on the proposal.
> 
> On Wed, Jan 13, 2021 at 9:21 AM Mirac Vuslat Basaran <mi...@google.com>
> wrote:
> 
> > Wrote a design draft for resource-related annotations. Please have a look:
> > https://docs.google.com/document/d/1phExeGD1gdDI9M8LK4ZG57UGa7dswpB8Aj6jxWj4uQk/edit?usp=sharing
> >
> > Cheers,
> > Mirac
> > On 2020/11/26 20:20:09, Mirac Vuslat Basaran <mi...@google.com> wrote:
> > > Created a PR without unit tests at
> > https://github.com/apache/beam/pull/13434. Please have a look.
> > >
> > > Thanks,
> > > Mirac
> > >
> > >
> > > On 2020/11/25 18:50:19, Robert Burke <ro...@frantil.com> wrote:
> > > > Hmmm. Fair. I'm mostly concerned about the pathological case where we
> > end
> > > > up with a distinct Environment per transform, but there are likely
> > > > practical cases where that's reasonable (High mem to GPU to TPU, to
> > ARM....)
> > > >
> > > > On Wed, Nov 25, 2020, 10:42 AM Robert Bradshaw <ro...@google.com>
> > wrote:
> > > >
> > > > > I'd like to continue the discussion *and* see an implementation for
> > > > > the part we've settled on. I was asking why not have "every distinct
> > > > > physical concern means a distinct environment?"
> > > > >
> > > > > On Wed, Nov 25, 2020 at 10:38 AM Robert Burke <ro...@frantil.com>
> > wrote:
> > > > > >
> > > > > > Mostly because perfect is the enemy of good enough. We have a
> > proposal,
> > > > > we have clear boundaries for it. It's fine if the discussion
> > continues, but
> > > > > I see no evidence of concerns that should prevent starting an
> > > > > implementation, because it seems we'll need both anyway.
> > > > > >
> > > > > > On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw <
> > robertwb@google.com>
> > > > > wrote:
> > > > > >>
> > > > > >> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke <robert@frantil.com
> > >
> > > > > wrote:
> > > > > >> >
> > > > > >> > It sounds like we've come to the position that non-correctness
> > > > > affecting Ptransform Annotations are valuable at both leaf and
> > composite
> > > > > levels, and don't remove the potential need for a similar feature on
> > > > > Environments, to handle physical concerns equirements for worker
> > processes
> > > > > to have (such as Ram, CPU, or GPU requirements.)
> > > > > >> >
> > > > > >> > Kenn, it's not clear what part of the solution (an annotation
> > field
> > > > > on the Ptransform proto message) would need to change to satisfy
> > your scope
> > > > > concern, beyond documenting unambiguously that these may not be used
> > for
> > > > > physical concerns or things that affect correctness.
> > > > > >>
> > > > > >> I'll let Kenn answer as well, but from my point of view,
> > explicitly
> > > > > >> having somewhere better to put these things would help.
> > > > > >>
> > > > > >> > I'm also unclear your scope concern not matching, given the
> > above.
> > > > > Your first paragraph reads very supportive of logical annotations on
> > > > > Ptransforms, and that matches 1-1 with the current proposed
> > solution. Can
> > > > > you clarify your concern?
> > > > > >> >
> > > > > >> > I don't wish to scope creep on the physical requirements issue
> > at
> > > > > this time. It seems we are agreed they should end up on
> > environments, but
> > > > > I'm not seeing proposals on the right way to execute them at this
> > time.They
> > > > > seem to be a fruitful topic of discussion, in particular
> > > > > unifying/consolidating them for efficient use of resources. I don't
> > think
> > > > > we want to end up in a state where every distinct physical concern
> > means a
> > > > > distinct environment.
> > > > > >>
> > > > > >> Why not? Assuming, of course, that runners are free to merge
> > > > > >> environments (merging those resource hints they understand and are
> > > > > >> otherwise compatible, and discarding those they don't) for
> > efficient
> > > > > >> execution.
> > > > > >>
> > > > > >> > I for one am ready to see a PR.
> > > > > >>
> > > > > >> +1
> > > > > >>
> > > > > >> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles <ke...@apache.org>
> > > > > wrote:
> > > > > >> >>
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw <
> > robertwb@google.com>
> > > > > wrote:
> > > > > >> >>>
> > > > > >> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran <
> > > > > miracb@google.com> wrote:
> > > > > >> >>> >
> > > > > >> >>> > Thanks everyone so much for their input and for the
> > insightful
> > > > > discussion.
> > > > > >> >>> >
> > > > > >> >>> > Not being knowledgeable about Beam's internals, I have to
> > say I
> > > > > am a bit lost on the PTransform vs. environment discussion.
> > > > > >> >>> >
> > > > > >> >>> > I do agree with Burke's notion that merge rules are very
> > > > > annotation dependent, I don't think we can find a one-size-fits-all
> > > > > solution for that. So this might be actually be an argument in
> > favour of
> > > > > having annotations on PTransforms, since it avoids the conflation
> > with
> > > > > environments.
> > > > > >> >>> >
> > > > > >> >>> > Also in general, I feel that having annotations per single
> > > > > transform (rather than composite) and on PTransforms could lead to a
> > > > > simpler design.
> > > > > >> >>>
> > > > > >> >>> If we want to use these for privacy, I don't see how
> > attaching them
> > > > > to
> > > > > >> >>> leaf transforms alone could work. (Even CombinePerKey is a
> > > > > composite.)
> > > > > >> >>>
> > > > > >> >>> > Seeing as there are valuable arguments in favour of both
> > > > > (PTransform and environments) with no clear(?) "best solution", I
> > would
> > > > > propose moving forward with the initial (PTransform) design to ship
> > the
> > > > > feature and unblock teams asking for it. If it turns out that there
> > was
> > > > > indeed a need to have annotations in environments, we could always
> > refactor
> > > > > it.
> > > > > >> >>>
> > > > > >> >>> I have yet to see any arguments that resource-level hints,
> > such as
> > > > > >> >>> memory or GPU, don't better belong on the environment. But
> > moving
> > > > > >> >>> forward on PTransform-level ones for logical statements like
> > privacy
> > > > > >> >>> declarations makes sense.
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> Exactly this. Properties of transforms make sense. The
> > properties
> > > > > may hold only of the whole subgraph. Even something as simple as
> > "preserves
> > > > > keys". This is analogous (but converse) to requirements like
> > "requires
> > > > > sorted input" which were explicitly excluded from the scope, which
> > was
> > > > > about hardware environment for execution.
> > > > > >> >>
> > > > > >> >> The proposed scope and the proposed solution do not match and
> > need
> > > > > to be reconciled.
> > > > > >> >>
> > > > > >> >> Kenn
> > > > > >> >>
> > > > > >> >>> > On 2020/11/17 19:07:22, Robert Bradshaw <
> > robertwb@google.com>
> > > > > wrote:
> > > > > >> >>> > > So far we have two distinct usecases for annotations:
> > resource
> > > > > hints
> > > > > >> >>> > > and privacy directives, and I've been trying to figure
> > out how
> > > > > to
> > > > > >> >>> > > reconcile them, but they seem to have very different
> > > > > characteristics.
> > > > > >> >>> > > (It would be nice to come up with other uses as well to
> > see if
> > > > > we're
> > > > > >> >>> > > really coming up with a generally useful mode--I think
> > display
> > > > > data
> > > > > >> >>> > > could fit into this as a new kind of annotation rather
> > than
> > > > > being a
> > > > > >> >>> > > top-level property, and it could make sense on both leaf
> > and
> > > > > composite
> > > > > >> >>> > > transforms.)
> > > > > >> >>> > >
> > > > > >> >>> > > To me, resource hints like GPU are inextricably tied to
> > the
> > > > > >> >>> > > environment. A transform tagged with GPU should reference
> > a Fn
> > > > > that
> > > > > >> >>> > > invokes GPU-accelerated code that lives in a particular
> > > > > environment.
> > > > > >> >>> > > Something like high-mem is a bit squishier. Some DoFns
> > take a
> > > > > lot of
> > > > > >> >>> > > memory, but on the other hand one could imagine labeling a
> > > > > CoGBK as
> > > > > >> >>> > > high-mem due to knowing that, in this particular usage,
> > there
> > > > > will be
> > > > > >> >>> > > lots of values with the same key. Ideally runners would be
> > > > > intelligent
> > > > > >> >>> > > enough to automatically learn memory usage, but even in
> > this
> > > > > case it
> > > > > >> >>> > > may be a good hint to try and learn the requirements for
> > DoFn A
> > > > > and
> > > > > >> >>> > > DoFn B separately (which is difficult if they are always
> > > > > colocated,
> > > > > >> >>> > > but valuable if, e.g. A takes a huge amount of memory and
> > B
> > > > > takes a
> > > > > >> >>> > > huge amount of wall time).
> > > > > >> >>> > >
> > > > > >> >>> > > Note that tying things to the environment does not
> > preclude
> > > > > using them
> > > > > >> >>> > > in non-portable runners as they'll still have an SDK-level
> > > > > >> >>> > > representation (though I don't think we should have an
> > explicit
> > > > > goal
> > > > > >> >>> > > of feature parity for non-portable runners, e.g.
> > multi-language
> > > > > isn't
> > > > > >> >>> > > happening, and hope that non-portable runners go away soon
> > > > > anyway).
> > > > > >> >>> > >
> > > > > >> >>> > > Now let's consider privacy annotations. To make things
> > very
> > > > > concrete,
> > > > > >> >>> > > imagine a transform AverageSpendPerZipCode which takes as
> > input
> > > > > (user,
> > > > > >> >>> > > zip, spend), all users unique, and returns (zip,
> > avg(spend)). In
> > > > > >> >>> > > Python, this is GroupBy('zip').aggregate_field('spend',
> > > > > >> >>> > > MeanCombineFn()). This is not very privacy preserving to
> > those
> > > > > users
> > > > > >> >>> > > who are the only (or one of a few) in a zip code. So we
> > could
> > > > > define a
> > > > > >> >>> > > transform PrivacyPreservingAverageSpendPerZipCode as
> > > > > >> >>> > >
> > > > > >> >>> > > @ptransform_fn
> > > > > >> >>> > > def
> > PrivacyPreservingAverageSpendPerZipCode(spend_per_user,
> > > > > threshold)
> > > > > >> >>> > >     counts_per_zip = spend_per_user |
> > > > > >> >>> > > GroupBy('zip').aggregate_field('user', CountCombineFn())
> > > > > >> >>> > >     spend_per_zip = spend_per_user |
> > > > > >> >>> > > GroupBy('zip').aggregate_field('spend', MeanCombineFn())
> > > > > >> >>> > >     filtered = spend_per_zip | beam.Filter(
> > > > > >> >>> > >         lambda x, counts: counts[x.zip] > threshold,
> > > > > >> >>> > > counts=AsMap(counts_per_zip))
> > > > > >> >>> > >     return filtered
> > > > > >> >>> > >
> > > > > >> >>> > > We now have a composite that has privacy preserving
> > properties
> > > > > (i.e.
> > > > > >> >>> > > the input may be quite sensitive, but the output is not,
> > > > > depending on
> > > > > >> >>> > > the value of threshold). What is interesting here is that
> > it is
> > > > > only
> > > > > >> >>> > > the composite that has this property--no individual
> > > > > sub-transform is
> > > > > >> >>> > > itself privacy preserving. Furthermore, an optimizer may
> > notice
> > > > > we're
> > > > > >> >>> > > doing aggregation on the same key twice and rewrite this
> > using
> > > > > >> >>> > > (logically)
> > > > > >> >>> > >
> > > > > >> >>> > >     GroupBy('zip').aggregate_field('user',
> > > > > >> >>> > > CountCombineFn()).aggregate_field('spend',
> > MeanCombineFn())
> > > > > >> >>> > >
> > > > > >> >>> > > and then applying the filter, which is semantically
> > equivalent
> > > > > and
> > > > > >> >>> > > satisfies the privacy annotations (and notably that does
> > not
> > > > > even
> > > > > >> >>> > > require the optimizer to interpret the annotations, just
> > pass
> > > > > them
> > > > > >> >>> > > on). To me, this implies that these annotations belong on
> > the
> > > > > >> >>> > > composites, and *not* on the leaf nodes (where they would
> > be
> > > > > >> >>> > > incorrect).
> > > > > >> >>> > >
> > > > > >> >>> > > I'll leave aside most questions of API until we figure
> > out the
> > > > > model
> > > > > >> >>> > > semantics, but wanted to throw one possible idea out
> > (though I
> > > > > am
> > > > > >> >>> > > ambivalent about it). Instead of attaching things to
> > > > > transforms, we
> > > > > >> >>> > > can just wrap transforms in composites that have no role
> > other
> > > > > than
> > > > > >> >>> > > declaring information about their contents. E.g. we could
> > have a
> > > > > >> >>> > > composite transform whose payload is simply an assertion
> > of the
> > > > > >> >>> > > privacy (or resource?) properties of its inner structure.
> > This
> > > > > would
> > > > > >> >>> > > be just as expressive as adding new properties to
> > transforms
> > > > > >> >>> > > themselves (but would add an extra level of nesting, and
> > make
> > > > > >> >>> > > respecting the precice nesting more important).
> > > > > >> >>> > >
> > > > > >> >>> > > On Tue, Nov 17, 2020 at 8:12 AM Robert Burke <
> > > > > robert@frantil.com> wrote:
> > > > > >> >>> > > >
> > > > > >> >>> > > > +1 to discussing PCollection annotations on a separate
> > > > > thread. It would be confusing to mix them up.
> > > > > >> >>> > > >
> > > > > >> >>> > > > -----------
> > > > > >> >>> > > >
> > > > > >> >>> > > > The question around conflicts is interesting, but
> > confusing
> > > > > to me. I don't think they exist in general. I keep coming back
> > around to
> > > > > that it depends on the annotation and the purpose of composites.
> > > > > Optionality saves us here too.
> > > > > >> >>> > > >
> > > > > >> >>> > > > Composites are nothing without their internal hypergraph
> > > > > structure. Eventually it comes down to executing the leaf nodes. The
> > > > > alternative to executing the leaf nodes is when the composite
> > represents a
> > > > > known transform and is replaced by the runner on submission time.
> > Lets
> > > > > look at each.
> > > > > >> >>> > > >
> > > > > >> >>> > > > If there's a property that only exists on the leaf
> > nodes,
> > > > > then it's not possible to bubble up that property to the composite
> > in all
> > > > > cases. Afterall, it's not necessarily the case that a privacy
> > preserving
> > > > > transform maintains the property for all output edges as not all
> > such edges
> > > > > pass through the preserving transform.
> > > > > >> >>> > > >
> > > > > >> >>> > > > On the other hand, with memory or gpu recommendations,
> > that
> > > > > might set a low bar on the composite level.
> > > > > >> >>> > > >
> > > > > >> >>> > > > But, composites (any transform really) can be runner
> > > > > replaced. I think it's fair to say that a runner replaced composite
> > is not
> > > > > beholden to the annotations of the original leaf transforms,
> > especially
> > > > > around physical requirements. The implementations are different. If
> > a known
> > > > > composite at the composite level requires GPUs and it's known
> > replacement
> > > > > doesn't, I'd posit that replacement was a choice the runner made
> > since it
> > > > > can't provision machines with GPUs.
> > > > > >> >>> > > >
> > > > > >> >>> > > > But, crucially around privacy annotated transforms, a
> > runner
> > > > > likely shouldn't replace a given subgraph that contains a privacy
> > > > > annotationed transform unless the replacements provide the same
> > level of
> > > > > privacy. However, such replacements only happens with well known
> > transforms
> > > > > with known properties anyway, so this can serve as an additional
> > layer of
> > > > > validation for a runner aware of the properties.
> > > > > >> >>> > > >
> > > > > >> >>> > > > This brings me back to my position: that the notion of
> > > > > conflicts is very annotation dependant, and that defining them as
> > optional
> > > > > is the most important feature to avoid issues. Conflicts don't exist
> > as an
> > > > > inherent property of annotations on ptransform of the hypergraph
> > structure.
> > > > > Am i wrong? No one has come up with an actual example of a conflict
> > as far
> > > > > as i understand the thread.
> > > > > >> >>> > > >
> > > > > >> >>> > > > Even Reuven's original question is more about whether
> > the
> > > > > runner is forced to look at leaf bodes rather than only looking at
> > the
> > > > > composite. Assuming the composite isn't replaced, the runner needs
> > to look
> > > > > at the leaf nodes regardless. And as discussed above there's no
> > generalized
> > > > > semantics that fit for all kinds of annotations, once replacements
> > are also
> > > > > considered.
> > > > > >> >>> > > >
> > > > > >> >>> > > > On Tue, Nov 17, 2020, 6:35 AM Ismaël Mejía <
> > iemejia@gmail.com>
> > > > > wrote:
> > > > > >> >>> > > >>
> > > > > >> >>> > > >> +1 Nice to see there is finally interest on this.
> > > > > Annotations for
> > > > > >> >>> > > >> PTransforms make total sense!
> > > > > >> >>> > > >>
> > > > > >> >>> > > >> The semantics should be strictly optional for runners
> > and
> > > > > correct
> > > > > >> >>> > > >> execution should not be affected by lack of support of
> > any
> > > > > annotation.
> > > > > >> >>> > > >> We should however keep the set of annotations small.
> > > > > >> >>> > > >>
> > > > > >> >>> > > >> > PTransforms are hierarchical - namely a PTransform
> > > > > contains other PTransforms, and so on. Is the runner expected to
> > resolve
> > > > > all annotations down to leaf nodes? What happens if that results in
> > > > > conflicting annotations?
> > > > > >> >>> > > >>
> > > > > >> >>> > > >> +1 to this question, This needs to be detailed.
> > > > > >> >>> > > >>
> > > > > >> >>> > > >> I am curious about how the end user APIs of this will
> > look
> > > > > maybe in
> > > > > >> >>> > > >> Java or Python, just an extra method to inject a Map
> > or via
> > > > > Java
> > > > > >> >>> > > >> annotations/Python decorators?
> > > > > >> >>> > > >>
> > > > > >> >>> > > >> We might prefer not to mix the concepts of annotations
> > and
> > > > > >> >>> > > >> environments because this will limit the scope of
> > > > > annotations.
> > > > > >> >>> > > >> Annotations are different from environments because
> > they
> > > > > serve a more
> > > > > >> >>> > > >> general idea: to express an intention and it is up to
> > the
> > > > > runner to
> > > > > >> >>> > > >> choose the strategy to accomplish this, for example in
> > the
> > > > > GPU
> > > > > >> >>> > > >> assignation case it could be to rewrite resource
> > allocation
> > > > > via
> > > > > >> >>> > > >> Environments but it could also just delegate this to a
> > > > > resource
> > > > > >> >>> > > >> manager which is what we could do for example for GPU
> > (or
> > > > > data
> > > > > >> >>> > > >> locality) cases on the Spark/Flink classic runners. If
> > we
> > > > > tie this to
> > > > > >> >>> > > >> environments we will leave classic runners out of the
> > loop
> > > > > for no
> > > > > >> >>> > > >> major reason and also not cover use cases not related
> > to
> > > > > resource
> > > > > >> >>> > > >> allocation.
> > > > > >> >>> > > >>
> > > > > >> >>> > > >> I do not understand the use case to justify PCollection
> > > > > annotations
> > > > > >> >>> > > >> but to not mix this thread with them, would you be
> > > > > interested to
> > > > > >> >>> > > >> elaborate more about them in a different thread Jan?
> > > > > >> >>> > > >>
> > > > > >> >>> > > >> On Tue, Nov 17, 2020 at 2:28 AM Robert Bradshaw <
> > > > > robertwb@google.com> wrote:
> > > > > >> >>> > > >> >
> > > > > >> >>> > > >> > I agree things like GPU, high-mem, etc. belong to the
> > > > > environment. If
> > > > > >> >>> > > >> > annotations are truly advisory, one can imagine
> > merging
> > > > > environments
> > > > > >> >>> > > >> > by taking the union of annotations and still
> > producing a
> > > > > correct
> > > > > >> >>> > > >> > pipeline. (This would mean that annotations would
> > have to
> > > > > be a
> > > > > >> >>> > > >> > multi-map...)
> > > > > >> >>> > > >> >
> > > > > >> >>> > > >> > On the other hand, this doesn't seem to handle the
> > case of
> > > > > privacy
> > > > > >> >>> > > >> > analysis, which could apply to composites without
> > applying
> > > > > to any
> > > > > >> >>> > > >> > individual component, and don't really make sense as
> > part
> > > > > of a
> > > > > >> >>> > > >> > fusion/execution story.
> > > > > >> >>> > > >> >
> > > > > >> >>> > > >> > On Mon, Nov 16, 2020 at 4:00 PM Robert Burke <
> > > > > robert@frantil.com> wrote:
> > > > > >> >>> > > >> > >
> > > > > >> >>> > > >> > > That's good historical context.
> > > > > >> >>> > > >> > >
> > > > > >> >>> > > >> > > But then we'd still need to codify the annotation
> > would
> > > > > need to be optional, and not affect correctness.
> > > > > >> >>> > > >> > >
> > > > > >> >>> > > >> > > Conflicts become easier to manage, (as
> > environments with
> > > > > conflicting annotations simply don't get merged, and stay as distinct
> > > > > environments) but are still notionally annotation dependant. Do most
> > > > > runners handle environments so individually though?
> > > > > >> >>> > > >> > >
> > > > > >> >>> > > >> > > Reuven's question is a good one though. For the Go
> > SDK,
> > > > > and the proposed implementation i saw, they only applied to leaf
> > nodes.
> > > > > This is an artifact of how the Go SDK handles composites. Nothing
> > stops it
> > > > > from being implemented on the composites Go has, but it didn't make
> > sense
> > > > > otherwise. AFAICT Composites are generally for organizational
> > convenience
> > > > > and not for functional aspects. Is this wrong? Afterall, does it
> > make sense
> > > > > for environments to be on leaves and composites either? It's the
> > same issue
> > > > > there.
> > > > > >> >>> > > >> > >
> > > > > >> >>> > > >> > >
> > > > > >> >>> > > >> > > On Mon, Nov 16, 2020, 3:45 PM Kenneth Knowles <
> > > > > kenn@apache.org> wrote:
> > > > > >> >>> > > >> > >>
> > > > > >> >>> > > >> > >> I am +1 to the proposal but believe it should be
> > moved
> > > > > to the Environment. I could be convinced otherwise, but would want to
> > > > > really understand the details.
> > > > > >> >>> > > >> > >>
> > > > > >> >>> > > >> > >> I think we haven't done a great job communicating
> > the
> > > > > purpose of the Environment proto. It was explicitly created for this
> > > > > purpose.
> > > > > >> >>> > > >> > >>
> > > > > >> >>> > > >> > >> 1. It tells the runner things it needs to know to
> > > > > interpret the DoFn (or other UDF). So these are the existing proto
> > fields
> > > > > like docker image (in the payload) and required artifacts that were
> > staged.
> > > > > >> >>> > > >> > >> 2. It is also the place for additional
> > requirements or
> > > > > hints like "high mem" or "GPU" etc.
> > > > > >> >>> > > >> > >>
> > > > > >> >>> > > >> > >> Every user function has an associated
> > environment, and
> > > > > environments exist only for the purpose of executing user functions.
> > In
> > > > > fact, Environment originated as inline requirements/attributes for a
> > user
> > > > > function proto message and was separated just to make the proto
> > smaller.
> > > > > >> >>> > > >> > >>
> > > > > >> >>> > > >> > >> A PTransform is an abstract concept for
> > organizing the
> > > > > graph, not an executable thing. So a hint/capability/requirement on a
> > > > > PTransform only really makes sense as a scoping mechanism for
> > applying a
> > > > > hint to a bunch of functions within a subgraph. This seems like a
> > user
> > > > > interface concern and the SDK should own propagating the hints. If
> > the hint
> > > > > truly applies to the whole PTransform and *not* the parts, then I am
> > > > > interested in learning about that.
> > > > > >> >>> > > >> > >>
> > > > > >> >>> > > >> > >> Kenn
> > > > > >> >>> > > >> > >>
> > > > > >> >>> > > >> > >> On Mon, Nov 16, 2020 at 10:54 AM Robert Burke <
> > > > > robert@frantil.com> wrote:
> > > > > >> >>> > > >> > >>>
> > > > > >> >>> > > >> > >>> That's a good question.
> > > > > >> >>> > > >> > >>>
> > > > > >> >>> > > >> > >>> I think the main difference is a matter of scope.
> > > > > Annotations would apply to a PTransform while an environment applies
> > to
> > > > > sets of transforms. A difference is the optional nature of the
> > annotations
> > > > > they don't affect correctness. Runners don't need to do anything
> > with them
> > > > > and still execute the pipeline correctly.
> > > > > >> >>> > > >> > >>>
> > > > > >> >>> > > >> > >>> Consider a privacy analysis on a pipeline graph.
> > An
> > > > > annotation indicating that a transform provides a certain level of
> > > > > anonymization can be used in an analysis to determine if the
> > downstream
> > > > > transforms are encountering raw data or not.
> > > > > >> >>> > > >> > >>>
> > > > > >> >>> > > >> > >>> From my understanding (which can be wrong)
> > > > > environments are rigid. Transforms in different environments can't be
> > > > > fused. "This is the python env", "this is the java env" can't be
> > merged
> > > > > together. It's not clear to me that we have defined when
> > environments are
> > > > > safely fuseable outside of equality. There's value in that
> > simplicity.
> > > > > >> >>> > > >> > >>>
> > > > > >> >>> > > >> > >>> AFIACT environment has less to do with the
> > machines a
> > > > > pipeline is executing on than it does about the kinds of SDK
> > pipelines it
> > > > > understands and can execute.
> > > > > >> >>> > > >> > >>>
> > > > > >> >>> > > >> > >>>
> > > > > >> >>> > > >> > >>>
> > > > > >> >>> > > >> > >>> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova <
> > > > > chadrik@gmail.com> wrote:
> > > > > >> >>> > > >> > >>>>>
> > > > > >> >>> > > >> > >>>>>
> > > > > >> >>> > > >> > >>>>> Another example of an optional annotation is
> > marking
> > > > > a transform to run on secure hardware, or to give hints to
> > > > > profiling/dynamic analysis tools.
> > > > > >> >>> > > >> > >>>>
> > > > > >> >>> > > >> > >>>>
> > > > > >> >>> > > >> > >>>> There seems to be a lot of overlap between this
> > idea
> > > > > and Environments.  Can you talk about how you feel they may be
> > different or
> > > > > related?  For example, I could see annotations as a way of tagging
> > > > > transforms with an Environment, or I could see Environments becoming
> > a
> > > > > specialized form of annotation.
> > > > > >> >>> > > >> > >>>>
> > > > > >> >>> > > >> > >>>> -chad
> > > > > >> >>> > > >> > >>>>
> > > > > >> >>> > >
> > > > >
> > > >
> > >
> >
>

Re: PTransform Annotations Proposal

Posted by Valentyn Tymofieiev <va...@google.com>.

I took a stab at implementing Python API for expressing resource hints[1]
in line with the discussion in this thread and added a few clarifications
to the design [2].

Looking forward to having resource hints in Beam.

Thanks!

[1] https://github.com/apache/beam/pull/14390
[2]
https://docs.google.com/document/d/1phExeGD1gdDI9M8LK4ZG57UGa7dswpB8Aj6jxWj4uQk/edit?usp=sharing




On Tue, Feb 2, 2021 at 9:01 AM Sunil Pedapudi <sk...@google.com> wrote:

> Thanks very much, Mirac. I added a few comments as well, and the proposal
> looks good to me. Thanks for all the discussion.
>
> On 2021/01/14 20:24:19, Robert Bradshaw <ro...@google.com> wrote:
> > Thanks. I added some comments on the proposal.
> >
> > On Wed, Jan 13, 2021 at 9:21 AM Mirac Vuslat Basaran <mi...@google.com>
> > wrote:
> >
> > > Wrote a design draft for resource-related annotations. Please have a
> look:
> > >
> https://docs.google.com/document/d/1phExeGD1gdDI9M8LK4ZG57UGa7dswpB8Aj6jxWj4uQk/edit?usp=sharing
> > >
> > > Cheers,
> > > Mirac
> > > On 2020/11/26 20:20:09, Mirac Vuslat Basaran <mi...@google.com>
> wrote:
> > > > Created a PR without unit tests at
> > > https://github.com/apache/beam/pull/13434. Please have a look.
> > > >
> > > > Thanks,
> > > > Mirac
> > > >
> > > >
> > > > On 2020/11/25 18:50:19, Robert Burke <ro...@frantil.com> wrote:
> > > > > Hmmm. Fair. I'm mostly concerned about the pathological case where
> we
> > > end
> > > > > up with a distinct Environment per transform, but there are likely
> > > > > practical cases where that's reasonable (High mem to GPU to TPU, to
> > > ARM....)
> > > > >
> > > > > On Wed, Nov 25, 2020, 10:42 AM Robert Bradshaw <
> robertwb@google.com>
> > > wrote:
> > > > >
> > > > > > I'd like to continue the discussion *and* see an implementation
> for
> > > > > > the part we've settled on. I was asking why not have "every
> distinct
> > > > > > physical concern means a distinct environment?"
> > > > > >
> > > > > > On Wed, Nov 25, 2020 at 10:38 AM Robert Burke <
> robert@frantil.com>
> > > wrote:
> > > > > > >
> > > > > > > Mostly because perfect is the enemy of good enough. We have a
> > > proposal,
> > > > > > we have clear boundaries for it. It's fine if the discussion
> > > continues, but
> > > > > > I see no evidence of concerns that should prevent starting an
> > > > > > implementation, because it seems we'll need both anyway.
> > > > > > >
> > > > > > > On Wed, Nov 25, 2020, 10:25 AM Robert Bradshaw <
> > > robertwb@google.com>
> > > > > > wrote:
> > > > > > >>
> > > > > > >> On Wed, Nov 25, 2020 at 10:15 AM Robert Burke <
> robert@frantil.com
> > > >
> > > > > > wrote:
> > > > > > >> >
> > > > > > >> > It sounds like we've come to the position that
> non-correctness
> > > > > > affecting Ptransform Annotations are valuable at both leaf and
> > > composite
> > > > > > levels, and don't remove the potential need for a similar
> feature on
> > > > > > Environments, to handle physical concerns equirements for worker
> > > processes
> > > > > > to have (such as Ram, CPU, or GPU requirements.)
> > > > > > >> >
> > > > > > >> > Kenn, it's not clear what part of the solution (an
> annotation
> > > field
> > > > > > on the Ptransform proto message) would need to change to satisfy
> > > your scope
> > > > > > concern, beyond documenting unambiguously that these may not be
> used
> > > for
> > > > > > physical concerns or things that affect correctness.
> > > > > > >>
> > > > > > >> I'll let Kenn answer as well, but from my point of view,
> > > explicitly
> > > > > > >> having somewhere better to put these things would help.
> > > > > > >>
> > > > > > >> > I'm also unclear your scope concern not matching, given the
> > > above.
> > > > > > Your first paragraph reads very supportive of logical
> annotations on
> > > > > > Ptransforms, and that matches 1-1 with the current proposed
> > > solution. Can
> > > > > > you clarify your concern?
> > > > > > >> >
> > > > > > >> > I don't wish to scope creep on the physical requirements
> issue
> > > at
> > > > > > this time. It seems we are agreed they should end up on
> > > environments, but
> > > > > > I'm not seeing proposals on the right way to execute them at this
> > > time.They
> > > > > > seem to be a fruitful topic of discussion, in particular
> > > > > > unifying/consolidating them for efficient use of resources. I
> don't
> > > think
> > > > > > we want to end up in a state where every distinct physical
> concern
> > > means a
> > > > > > distinct environment.
> > > > > > >>
> > > > > > >> Why not? Assuming, of course, that runners are free to merge
> > > > > > >> environments (merging those resource hints they understand
> and are
> > > > > > >> otherwise compatible, and discarding those they don't) for
> > > efficient
> > > > > > >> execution.
> > > > > > >>
> > > > > > >> > I for one am ready to see a PR.
> > > > > > >>
> > > > > > >> +1
> > > > > > >>
> > > > > > >> > On Mon, Nov 23, 2020, 6:02 PM Kenneth Knowles <
> kenn@apache.org>
> > > > > > wrote:
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >> On Mon, Nov 23, 2020 at 3:04 PM Robert Bradshaw <
> > > robertwb@google.com>
> > > > > > wrote:
> > > > > > >> >>>
> > > > > > >> >>> On Fri, Nov 20, 2020 at 11:08 AM Mirac Vuslat Basaran <
> > > > > > miracb@google.com> wrote:
> > > > > > >> >>> >
> > > > > > >> >>> > Thanks everyone so much for their input and for the
> > > insightful
> > > > > > discussion.
> > > > > > >> >>> >
> > > > > > >> >>> > Not being knowledgeable about Beam's internals, I have
> to
> > > say I
> > > > > > am a bit lost on the PTransform vs. environment discussion.
> > > > > > >> >>> >
> > > > > > >> >>> > I do agree with Burke's notion that merge rules are very
> > > > > > annotation dependent, I don't think we can find a
> one-size-fits-all
> > > > > > solution for that. So this might be actually be an argument in
> > > favour of
> > > > > > having annotations on PTransforms, since it avoids the conflation
> > > with
> > > > > > environments.
> > > > > > >> >>> >
> > > > > > >> >>> > Also in general, I feel that having annotations per
> single
> > > > > > transform (rather than composite) and on PTransforms could lead
> to a
> > > > > > simpler design.
> > > > > > >> >>>
> > > > > > >> >>> If we want to use these for privacy, I don't see how
> > > attaching them
> > > > > > to
> > > > > > >> >>> leaf transforms alone could work. (Even CombinePerKey is a
> > > > > > composite.)
> > > > > > >> >>>
> > > > > > >> >>> > Seeing as there are valuable arguments in favour of both
> > > > > > (PTransform and environments) with no clear(?) "best solution", I
> > > would
> > > > > > propose moving forward with the initial (PTransform) design to
> ship
> > > the
> > > > > > feature and unblock teams asking for it. If it turns out that
> there
> > > was
> > > > > > indeed a need to have annotations in environments, we could
> always
> > > refactor
> > > > > > it.
> > > > > > >> >>>
> > > > > > >> >>> I have yet to see any arguments that resource-level hints,
> > > such as
> > > > > > >> >>> memory or GPU, don't better belong on the environment. But
> > > moving
> > > > > > >> >>> forward on PTransform-level ones for logical statements
> like
> > > privacy
> > > > > > >> >>> declarations makes sense.
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >> Exactly this. Properties of transforms make sense. The
> > > properties
> > > > > > may hold only of the whole subgraph. Even something as simple as
> > > "preserves
> > > > > > keys". This is analogous (but converse) to requirements like
> > > "requires
> > > > > > sorted input" which were explicitly excluded from the scope,
> which
> > > was
> > > > > > about hardware environment for execution.
> > > > > > >> >>
> > > > > > >> >> The proposed scope and the proposed solution do not match
> and
> > > need
> > > > > > to be reconciled.
> > > > > > >> >>
> > > > > > >> >> Kenn
> > > > > > >> >>
> > > > > > >> >>> > On 2020/11/17 19:07:22, Robert Bradshaw <
> > > robertwb@google.com>
> > > > > > wrote:
> > > > > > >> >>> > > So far we have two distinct usecases for annotations:
> > > resource
> > > > > > hints
> > > > > > >> >>> > > and privacy directives, and I've been trying to figure
> > > out how
> > > > > > to
> > > > > > >> >>> > > reconcile them, but they seem to have very different
> > > > > > characteristics.
> > > > > > >> >>> > > (It would be nice to come up with other uses as well
> to
> > > see if
> > > > > > we're
> > > > > > >> >>> > > really coming up with a generally useful mode--I think
> > > display
> > > > > > data
> > > > > > >> >>> > > could fit into this as a new kind of annotation rather
> > > than
> > > > > > being a
> > > > > > >> >>> > > top-level property, and it could make sense on both
> leaf
> > > and
> > > > > > composite
> > > > > > >> >>> > > transforms.)
> > > > > > >> >>> > >
> > > > > > >> >>> > > To me, resource hints like GPU are inextricably tied
> to
> > > the
> > > > > > >> >>> > > environment. A transform tagged with GPU should
> reference
> > > a Fn
> > > > > > that
> > > > > > >> >>> > > invokes GPU-accelerated code that lives in a
> particular
> > > > > > environment.
> > > > > > >> >>> > > Something like high-mem is a bit squishier. Some DoFns
> > > take a
> > > > > > lot of
> > > > > > >> >>> > > memory, but on the other hand one could imagine
> labeling a
> > > > > > CoGBK as
> > > > > > >> >>> > > high-mem due to knowing that, in this particular
> usage,
> > > there
> > > > > > will be
> > > > > > >> >>> > > lots of values with the same key. Ideally runners
> would be
> > > > > > intelligent
> > > > > > >> >>> > > enough to automatically learn memory usage, but even
> in
> > > this
> > > > > > case it
> > > > > > >> >>> > > may be a good hint to try and learn the requirements
> for
> > > DoFn A
> > > > > > and
> > > > > > >> >>> > > DoFn B separately (which is difficult if they are
> always
> > > > > > colocated,
> > > > > > >> >>> > > but valuable if, e.g. A takes a huge amount of memory
> and
> > > B
> > > > > > takes a
> > > > > > >> >>> > > huge amount of wall time).
> > > > > > >> >>> > >
> > > > > > >> >>> > > Note that tying things to the environment does not
> > > preclude
> > > > > > using them
> > > > > > >> >>> > > in non-portable runners as they'll still have an
> SDK-level
> > > > > > >> >>> > > representation (though I don't think we should have an
> > > explicit
> > > > > > goal
> > > > > > >> >>> > > of feature parity for non-portable runners, e.g.
> > > multi-language
> > > > > > isn't
> > > > > > >> >>> > > happening, and hope that non-portable runners go away
> soon
> > > > > > anyway).
> > > > > > >> >>> > >
> > > > > > >> >>> > > Now let's consider privacy annotations. To make things
> > > very
> > > > > > concrete,
> > > > > > >> >>> > > imagine a transform AverageSpendPerZipCode which
> takes as
> > > input
> > > > > > (user,
> > > > > > >> >>> > > zip, spend), all users unique, and returns (zip,
> > > avg(spend)). In
> > > > > > >> >>> > > Python, this is
> GroupBy('zip').aggregate_field('spend',
> > > > > > >> >>> > > MeanCombineFn()). This is not very privacy preserving
> to
> > > those
> > > > > > users
> > > > > > >> >>> > > who are the only (or one of a few) in a zip code. So
> we
> > > could
> > > > > > define a
> > > > > > >> >>> > > transform PrivacyPreservingAverageSpendPerZipCode as
> > > > > > >> >>> > >
> > > > > > >> >>> > > @ptransform_fn
> > > > > > >> >>> > > def
> > > PrivacyPreservingAverageSpendPerZipCode(spend_per_user,
> > > > > > threshold)
> > > > > > >> >>> > >     counts_per_zip = spend_per_user |
> > > > > > >> >>> > > GroupBy('zip').aggregate_field('user',
> CountCombineFn())
> > > > > > >> >>> > >     spend_per_zip = spend_per_user |
> > > > > > >> >>> > > GroupBy('zip').aggregate_field('spend',
> MeanCombineFn())
> > > > > > >> >>> > >     filtered = spend_per_zip | beam.Filter(
> > > > > > >> >>> > >         lambda x, counts: counts[x.zip] > threshold,
> > > > > > >> >>> > > counts=AsMap(counts_per_zip))
> > > > > > >> >>> > >     return filtered
> > > > > > >> >>> > >
> > > > > > >> >>> > > We now have a composite that has privacy preserving
> > > properties
> > > > > > (i.e.
> > > > > > >> >>> > > the input may be quite sensitive, but the output is
> not,
> > > > > > depending on
> > > > > > >> >>> > > the value of threshold). What is interesting here is
> that
> > > it is
> > > > > > only
> > > > > > >> >>> > > the composite that has this property--no individual
> > > > > > sub-transform is
> > > > > > >> >>> > > itself privacy preserving. Furthermore, an optimizer
> may
> > > notice
> > > > > > we're
> > > > > > >> >>> > > doing aggregation on the same key twice and rewrite
> this
> > > using
> > > > > > >> >>> > > (logically)
> > > > > > >> >>> > >
> > > > > > >> >>> > >     GroupBy('zip').aggregate_field('user',
> > > > > > >> >>> > > CountCombineFn()).aggregate_field('spend',
> > > MeanCombineFn())
> > > > > > >> >>> > >
> > > > > > >> >>> > > and then applying the filter, which is semantically
> > > equivalent
> > > > > > and
> > > > > > >> >>> > > satisfies the privacy annotations (and notably that
> does
> > > not
> > > > > > even
> > > > > > >> >>> > > require the optimizer to interpret the annotations,
> just
> > > pass
> > > > > > them
> > > > > > >> >>> > > on). To me, this implies that these annotations
> belong on
> > > the
> > > > > > >> >>> > > composites, and *not* on the leaf nodes (where they
> would
> > > be
> > > > > > >> >>> > > incorrect).
> > > > > > >> >>> > >
> > > > > > >> >>> > > I'll leave aside most questions of API until we figure
> > > out the
> > > > > > model
> > > > > > >> >>> > > semantics, but wanted to throw one possible idea out
> > > (though I
> > > > > > am
> > > > > > >> >>> > > ambivalent about it). Instead of attaching things to
> > > > > > transforms, we
> > > > > > >> >>> > > can just wrap transforms in composites that have no
> role
> > > other
> > > > > > than
> > > > > > >> >>> > > declaring information about their contents. E.g. we
> could
> > > have a
> > > > > > >> >>> > > composite transform whose payload is simply an
> assertion
> > > of the
> > > > > > >> >>> > > privacy (or resource?) properties of its inner
> structure.
> > > This
> > > > > > would
> > > > > > >> >>> > > be just as expressive as adding new properties to
> > > transforms
> > > > > > >> >>> > > themselves (but would add an extra level of nesting,
> and
> > > make
> > > > > > >> >>> > > respecting the precice nesting more important).
> > > > > > >> >>> > >
> > > > > > >> >>> > > On Tue, Nov 17, 2020 at 8:12 AM Robert Burke <
> > > > > > robert@frantil.com> wrote:
> > > > > > >> >>> > > >
> > > > > > >> >>> > > > +1 to discussing PCollection annotations on a
> separate
> > > > > > thread. It would be confusing to mix them up.
> > > > > > >> >>> > > >
> > > > > > >> >>> > > > -----------
> > > > > > >> >>> > > >
> > > > > > >> >>> > > > The question around conflicts is interesting, but
> > > confusing
> > > > > > to me. I don't think they exist in general. I keep coming back
> > > around to
> > > > > > that it depends on the annotation and the purpose of composites.
> > > > > > Optionality saves us here too.
> > > > > > >> >>> > > >
> > > > > > >> >>> > > > Composites are nothing without their internal
> hypergraph
> > > > > > structure. Eventually it comes down to executing the leaf nodes.
> The
> > > > > > alternative to executing the leaf nodes is when the composite
> > > represents a
> > > > > > known transform and is replaced by the runner on submission time.
> > > Lets
> > > > > > look at each.
> > > > > > >> >>> > > >
> > > > > > >> >>> > > > If there's a property that only exists on the leaf
> > > nodes,
> > > > > > then it's not possible to bubble up that property to the
> composite
> > > in all
> > > > > > cases. Afterall, it's not necessarily the case that a privacy
> > > preserving
> > > > > > transform maintains the property for all output edges as not all
> > > such edges
> > > > > > pass through the preserving transform.
> > > > > > >> >>> > > >
> > > > > > >> >>> > > > On the other hand, with memory or gpu
> recommendations,
> > > that
> > > > > > might set a low bar on the composite level.
> > > > > > >> >>> > > >
> > > > > > >> >>> > > > But, composites (any transform really) can be runner
> > > > > > replaced. I think it's fair to say that a runner replaced
> composite
> > > is not
> > > > > > beholden to the annotations of the original leaf transforms,
> > > especially
> > > > > > around physical requirements. The implementations are different.
> If
> > > a known
> > > > > > composite at the composite level requires GPUs and it's known
> > > replacement
> > > > > > doesn't, I'd posit that replacement was a choice the runner made
> > > since it
> > > > > > can't provision machines with GPUs.
> > > > > > >> >>> > > >
> > > > > > >> >>> > > > But, crucially around privacy annotated transforms,
> a
> > > runner
> > > > > > likely shouldn't replace a given subgraph that contains a privacy
> > > > > > annotationed transform unless the replacements provide the same
> > > level of
> > > > > > privacy. However, such replacements only happens with well known
> > > transforms
> > > > > > with known properties anyway, so this can serve as an additional
> > > layer of
> > > > > > validation for a runner aware of the properties.
> > > > > > >> >>> > > >
> > > > > > >> >>> > > > This brings me back to my position: that the notion
> of
> > > > > > conflicts is very annotation dependant, and that defining them as
> > > optional
> > > > > > is the most important feature to avoid issues. Conflicts don't
> exist
> > > as an
> > > > > > inherent property of annotations on ptransform of the hypergraph
> > > structure.
> > > > > > Am i wrong? No one has come up with an actual example of a
> conflict
> > > as far
> > > > > > as i understand the thread.
> > > > > > >> >>> > > >
> > > > > > >> >>> > > > Even Reuven's original question is more about
> whether
> > > the
> > > > > > runner is forced to look at leaf bodes rather than only looking
> at
> > > the
> > > > > > composite. Assuming the composite isn't replaced, the runner
> needs
> > > to look
> > > > > > at the leaf nodes regardless. And as discussed above there's no
> > > generalized
> > > > > > semantics that fit for all kinds of annotations, once
> replacements
> > > are also
> > > > > > considered.
> > > > > > >> >>> > > >
> > > > > > >> >>> > > > On Tue, Nov 17, 2020, 6:35 AM Ismaël Mejía <
> > > iemejia@gmail.com>
> > > > > > wrote:
> > > > > > >> >>> > > >>
> > > > > > >> >>> > > >> +1 Nice to see there is finally interest on this.
> > > > > > Annotations for
> > > > > > >> >>> > > >> PTransforms make total sense!
> > > > > > >> >>> > > >>
> > > > > > >> >>> > > >> The semantics should be strictly optional for
> runners
> > > and
> > > > > > correct
> > > > > > >> >>> > > >> execution should not be affected by lack of
> support of
> > > any
> > > > > > annotation.
> > > > > > >> >>> > > >> We should however keep the set of annotations
> small.
> > > > > > >> >>> > > >>
> > > > > > >> >>> > > >> > PTransforms are hierarchical - namely a
> PTransform
> > > > > > contains other PTransforms, and so on. Is the runner expected to
> > > resolve
> > > > > > all annotations down to leaf nodes? What happens if that results
> in
> > > > > > conflicting annotations?
> > > > > > >> >>> > > >>
> > > > > > >> >>> > > >> +1 to this question, This needs to be detailed.
> > > > > > >> >>> > > >>
> > > > > > >> >>> > > >> I am curious about how the end user APIs of this
> will
> > > look
> > > > > > maybe in
> > > > > > >> >>> > > >> Java or Python, just an extra method to inject a
> Map
> > > or via
> > > > > > Java
> > > > > > >> >>> > > >> annotations/Python decorators?
> > > > > > >> >>> > > >>
> > > > > > >> >>> > > >> We might prefer not to mix the concepts of
> annotations
> > > and
> > > > > > >> >>> > > >> environments because this will limit the scope of
> > > > > > annotations.
> > > > > > >> >>> > > >> Annotations are different from environments because
> > > they
> > > > > > serve a more
> > > > > > >> >>> > > >> general idea: to express an intention and it is up
> to
> > > the
> > > > > > runner to
> > > > > > >> >>> > > >> choose the strategy to accomplish this, for
> example in
> > > the
> > > > > > GPU
> > > > > > >> >>> > > >> assignation case it could be to rewrite resource
> > > allocation
> > > > > > via
> > > > > > >> >>> > > >> Environments but it could also just delegate this
> to a
> > > > > > resource
> > > > > > >> >>> > > >> manager which is what we could do for example for
> GPU
> > > (or
> > > > > > data
> > > > > > >> >>> > > >> locality) cases on the Spark/Flink classic
> runners. If
> > > we
> > > > > > tie this to
> > > > > > >> >>> > > >> environments we will leave classic runners out of
> the
> > > loop
> > > > > > for no
> > > > > > >> >>> > > >> major reason and also not cover use cases not
> related
> > > to
> > > > > > resource
> > > > > > >> >>> > > >> allocation.
> > > > > > >> >>> > > >>
> > > > > > >> >>> > > >> I do not understand the use case to justify
> PCollection
> > > > > > annotations
> > > > > > >> >>> > > >> but to not mix this thread with them, would you be
> > > > > > interested to
> > > > > > >> >>> > > >> elaborate more about them in a different thread
> Jan?
> > > > > > >> >>> > > >>
> > > > > > >> >>> > > >> On Tue, Nov 17, 2020 at 2:28 AM Robert Bradshaw <
> > > > > > robertwb@google.com> wrote:
> > > > > > >> >>> > > >> >
> > > > > > >> >>> > > >> > I agree things like GPU, high-mem, etc. belong
> to the
> > > > > > environment. If
> > > > > > >> >>> > > >> > annotations are truly advisory, one can imagine
> > > merging
> > > > > > environments
> > > > > > >> >>> > > >> > by taking the union of annotations and still
> > > producing a
> > > > > > correct
> > > > > > >> >>> > > >> > pipeline. (This would mean that annotations would
> > > have to
> > > > > > be a
> > > > > > >> >>> > > >> > multi-map...)
> > > > > > >> >>> > > >> >
> > > > > > >> >>> > > >> > On the other hand, this doesn't seem to handle
> the
> > > case of
> > > > > > privacy
> > > > > > >> >>> > > >> > analysis, which could apply to composites without
> > > applying
> > > > > > to any
> > > > > > >> >>> > > >> > individual component, and don't really make
> sense as
> > > part
> > > > > > of a
> > > > > > >> >>> > > >> > fusion/execution story.
> > > > > > >> >>> > > >> >
> > > > > > >> >>> > > >> > On Mon, Nov 16, 2020 at 4:00 PM Robert Burke <
> > > > > > robert@frantil.com> wrote:
> > > > > > >> >>> > > >> > >
> > > > > > >> >>> > > >> > > That's good historical context.
> > > > > > >> >>> > > >> > >
> > > > > > >> >>> > > >> > > But then we'd still need to codify the
> annotation
> > > would
> > > > > > need to be optional, and not affect correctness.
> > > > > > >> >>> > > >> > >
> > > > > > >> >>> > > >> > > Conflicts become easier to manage, (as
> > > environments with
> > > > > > conflicting annotations simply don't get merged, and stay as
> distinct
> > > > > > environments) but are still notionally annotation dependant. Do
> most
> > > > > > runners handle environments so individually though?
> > > > > > >> >>> > > >> > >
> > > > > > >> >>> > > >> > > Reuven's question is a good one though. For
> the Go
> > > SDK,
> > > > > > and the proposed implementation i saw, they only applied to leaf
> > > nodes.
> > > > > > This is an artifact of how the Go SDK handles composites. Nothing
> > > stops it
> > > > > > from being implemented on the composites Go has, but it didn't
> make
> > > sense
> > > > > > otherwise. AFAICT Composites are generally for organizational
> > > convenience
> > > > > > and not for functional aspects. Is this wrong? Afterall, does it
> > > make sense
> > > > > > for environments to be on leaves and composites either? It's the
> > > same issue
> > > > > > there.
> > > > > > >> >>> > > >> > >
> > > > > > >> >>> > > >> > >
> > > > > > >> >>> > > >> > > On Mon, Nov 16, 2020, 3:45 PM Kenneth Knowles <
> > > > > > kenn@apache.org> wrote:
> > > > > > >> >>> > > >> > >>
> > > > > > >> >>> > > >> > >> I am +1 to the proposal but believe it should
> be
> > > moved
> > > > > > to the Environment. I could be convinced otherwise, but would
> want to
> > > > > > really understand the details.
> > > > > > >> >>> > > >> > >>
> > > > > > >> >>> > > >> > >> I think we haven't done a great job
> communicating
> > > the
> > > > > > purpose of the Environment proto. It was explicitly created for
> this
> > > > > > purpose.
> > > > > > >> >>> > > >> > >>
> > > > > > >> >>> > > >> > >> 1. It tells the runner things it needs to
> know to
> > > > > > interpret the DoFn (or other UDF). So these are the existing
> proto
> > > fields
> > > > > > like docker image (in the payload) and required artifacts that
> were
> > > staged.
> > > > > > >> >>> > > >> > >> 2. It is also the place for additional
> > > requirements or
> > > > > > hints like "high mem" or "GPU" etc.
> > > > > > >> >>> > > >> > >>
> > > > > > >> >>> > > >> > >> Every user function has an associated
> > > environment, and
> > > > > > environments exist only for the purpose of executing user
> functions.
> > > In
> > > > > > fact, Environment originated as inline requirements/attributes
> for a
> > > user
> > > > > > function proto message and was separated just to make the proto
> > > smaller.
> > > > > > >> >>> > > >> > >>
> > > > > > >> >>> > > >> > >> A PTransform is an abstract concept for
> > > organizing the
> > > > > > graph, not an executable thing. So a hint/capability/requirement
> on a
> > > > > > PTransform only really makes sense as a scoping mechanism for
> > > applying a
> > > > > > hint to a bunch of functions within a subgraph. This seems like a
> > > user
> > > > > > interface concern and the SDK should own propagating the hints.
> If
> > > the hint
> > > > > > truly applies to the whole PTransform and *not* the parts, then
> I am
> > > > > > interested in learning about that.
> > > > > > >> >>> > > >> > >>
> > > > > > >> >>> > > >> > >> Kenn
> > > > > > >> >>> > > >> > >>
> > > > > > >> >>> > > >> > >> On Mon, Nov 16, 2020 at 10:54 AM Robert Burke
> <
> > > > > > robert@frantil.com> wrote:
> > > > > > >> >>> > > >> > >>>
> > > > > > >> >>> > > >> > >>> That's a good question.
> > > > > > >> >>> > > >> > >>>
> > > > > > >> >>> > > >> > >>> I think the main difference is a matter of
> scope.
> > > > > > Annotations would apply to a PTransform while an environment
> applies
> > > to
> > > > > > sets of transforms. A difference is the optional nature of the
> > > annotations
> > > > > > they don't affect correctness. Runners don't need to do anything
> > > with them
> > > > > > and still execute the pipeline correctly.
> > > > > > >> >>> > > >> > >>>
> > > > > > >> >>> > > >> > >>> Consider a privacy analysis on a pipeline
> graph.
> > > An
> > > > > > annotation indicating that a transform provides a certain level
> of
> > > > > > anonymization can be used in an analysis to determine if the
> > > downstream
> > > > > > transforms are encountering raw data or not.
> > > > > > >> >>> > > >> > >>>
> > > > > > >> >>> > > >> > >>> From my understanding (which can be wrong)
> > > > > > environments are rigid. Transforms in different environments
> can't be
> > > > > > fused. "This is the python env", "this is the java env" can't be
> > > merged
> > > > > > together. It's not clear to me that we have defined when
> > > environments are
> > > > > > safely fuseable outside of equality. There's value in that
> > > simplicity.
> > > > > > >> >>> > > >> > >>>
> > > > > > >> >>> > > >> > >>> AFIACT environment has less to do with the
> > > machines a
> > > > > > pipeline is executing on than it does about the kinds of SDK
> > > pipelines it
> > > > > > understands and can execute.
> > > > > > >> >>> > > >> > >>>
> > > > > > >> >>> > > >> > >>>
> > > > > > >> >>> > > >> > >>>
> > > > > > >> >>> > > >> > >>> On Mon, Nov 16, 2020, 10:36 AM Chad Dombrova
> <
> > > > > > chadrik@gmail.com> wrote:
> > > > > > >> >>> > > >> > >>>>>
> > > > > > >> >>> > > >> > >>>>>
> > > > > > >> >>> > > >> > >>>>> Another example of an optional annotation
> is
> > > marking
> > > > > > a transform to run on secure hardware, or to give hints to
> > > > > > profiling/dynamic analysis tools.
> > > > > > >> >>> > > >> > >>>>
> > > > > > >> >>> > > >> > >>>>
> > > > > > >> >>> > > >> > >>>> There seems to be a lot of overlap between
> this
> > > idea
> > > > > > and Environments.  Can you talk about how you feel they may be
> > > different or
> > > > > > related?  For example, I could see annotations as a way of
> tagging
> > > > > > transforms with an Environment, or I could see Environments
> becoming
> > > a
> > > > > > specialized form of annotation.
> > > > > > >> >>> > > >> > >>>>
> > > > > > >> >>> > > >> > >>>> -chad
> > > > > > >> >>> > > >> > >>>>
> > > > > > >> >>> > >
> > > > > >
> > > > >
> > > >
> > >
> >
>