You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Kenneth Knowles <ke...@apache.org> on 2021/01/05 14:00:00 UTC
Re: Combine with multiple outputs case Sample and the rest
Perhaps something based on stateful DoFn so there is a simple decision
point at which each element is either sampled or not so it can be output to
one PCollection or the other. Without doing a little research, I don't
recall if this is doable in the way you need.
Kenn
On Wed, Dec 23, 2020 at 3:12 PM Ismaël Mejía <ie...@gmail.com> wrote:
> Thanks for the answer Robert. Producing a combiner with two lists as
> outputs was one idea I was considering too but I was afraid of
> OutOfMemory issues. I had not thought much about the consequences on
> combining state, thanks for pointing that. For the particular sampling
> use case it might be not an issue, or am I missing something?
>
> I am still curious if for Sampling there could be another approach to
> achieve the same goal of producing the same result (uniform sample +
> the rest) but without the issues of combining.
>
> On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw <ro...@google.com>
> wrote:
> >
> > There are two ways to emit multiple outputs: either to multiple distinct
> PCollections (e.g. withOutputTags) or multiple (including 0) outputs to a
> single PCollection (the difference between Map and FlatMap). In full
> generality, one can always have a CombineFn that outputs lists (say <tag,
> result>*) followed by a DoFn that emits to multiple places based on this
> result.
> >
> > One other cons of emitting multiple values from a CombineFn is that they
> are used in other contexts as well, e.g. combining state, and trying to
> make sense of a multi-outputting CombineFn in that context is trickier.
> >
> > Note that for Sample in particular, it works as a CombineFn because we
> throw most of the data away. If we kept most of the data, it likely
> wouldn't fit into one machine to do the final sampling. The idea of using a
> side input to filter after the fact should work well (unless there's
> duplicate elements, in which case you'd have to uniquify them somehow to
> filter out only the "right" copies).
> >
> > - Robert
> >
> >
> >
> > On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <ie...@gmail.com> wrote:
> >>
> >> I had a question today from one of our users about Beam’s Sample
> >> transform (a Combine with an internal top-like function to produce a
> >> uniform sample of size n of a PCollection). They wanted to obtain also
> >> the rest of the PCollection as an output (the non sampled elements).
> >>
> >> My suggestion was to use the sample (since it was little) as a side
> >> input and then reprocess the collection to filter its elements,
> >> however I wonder if this is the ‘best’ solution.
> >>
> >> I was thinking also if Combine is essentially GbK + ParDo why we don’t
> >> have a Combine function with multiple outputs (maybe an evolution of
> >> CombineWithContext). I know this sounds weird and I have probably not
> >> thought much about issues or the performance of the translation but I
> >> wanted to see what others thought, does this make sense, do you see
> >> some pros/cons or other ideas.
> >>
> >> Thanks,
> >> Ismaël
>
Re: Combine with multiple outputs case Sample and the rest
Posted by Etienne Chauchot <ec...@apache.org>.
Hi all,
Regarding leveraging the Pardo part of Combine (Combine <=> GBK + Pardo)
to have multiple outputs, please note that most of the time Combine is
translated by the runners with a native (destination-tech) Combine and
not a GBK + Pardo.
Regarding using the Stateful DoFn I agree with Kenn with the little
exception that Statefull DoFn is not supported in streaming mode with
Spark runner.
But I guess, Ismaël, that the use case is batch mode.
Best
Etienne
On 05/01/2021 15:00, Kenneth Knowles wrote:
> Perhaps something based on stateful DoFn so there is a simple decision
> point at which each element is either sampled or not so it can be
> output to one PCollection or the other. Without doing a little
> research, I don't recall if this is doable in the way you need.
>
> Kenn
>
> On Wed, Dec 23, 2020 at 3:12 PM Ismaël Mejía <iemejia@gmail.com
> <ma...@gmail.com>> wrote:
>
> Thanks for the answer Robert. Producing a combiner with two lists as
> outputs was one idea I was considering too but I was afraid of
> OutOfMemory issues. I had not thought much about the consequences on
> combining state, thanks for pointing that. For the particular sampling
> use case it might be not an issue, or am I missing something?
>
> I am still curious if for Sampling there could be another approach to
> achieve the same goal of producing the same result (uniform sample +
> the rest) but without the issues of combining.
>
> On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw
> <robertwb@google.com <ma...@google.com>> wrote:
> >
> > There are two ways to emit multiple outputs: either to multiple
> distinct PCollections (e.g. withOutputTags) or multiple (including
> 0) outputs to a single PCollection (the difference between Map and
> FlatMap). In full generality, one can always have a CombineFn that
> outputs lists (say <tag, result>*) followed by a DoFn that emits
> to multiple places based on this result.
> >
> > One other cons of emitting multiple values from a CombineFn is
> that they are used in other contexts as well, e.g. combining
> state, and trying to make sense of a multi-outputting CombineFn in
> that context is trickier.
> >
> > Note that for Sample in particular, it works as a CombineFn
> because we throw most of the data away. If we kept most of the
> data, it likely wouldn't fit into one machine to do the final
> sampling. The idea of using a side input to filter after the fact
> should work well (unless there's duplicate elements, in which case
> you'd have to uniquify them somehow to filter out only the "right"
> copies).
> >
> > - Robert
> >
> >
> >
> > On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <iemejia@gmail.com
> <ma...@gmail.com>> wrote:
> >>
> >> I had a question today from one of our users about Beam’s Sample
> >> transform (a Combine with an internal top-like function to
> produce a
> >> uniform sample of size n of a PCollection). They wanted to
> obtain also
> >> the rest of the PCollection as an output (the non sampled
> elements).
> >>
> >> My suggestion was to use the sample (since it was little) as a side
> >> input and then reprocess the collection to filter its elements,
> >> however I wonder if this is the ‘best’ solution.
> >>
> >> I was thinking also if Combine is essentially GbK + ParDo why
> we don’t
> >> have a Combine function with multiple outputs (maybe an
> evolution of
> >> CombineWithContext). I know this sounds weird and I have
> probably not
> >> thought much about issues or the performance of the translation
> but I
> >> wanted to see what others thought, does this make sense, do you see
> >> some pros/cons or other ideas.
> >>
> >> Thanks,
> >> Ismaël
>