You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Kenneth Knowles <ke...@apache.org> on 2021/01/05 14:00:00 UTC

Re: Combine with multiple outputs case Sample and the rest

Perhaps something based on stateful DoFn so there is a simple decision
point at which each element is either sampled or not so it can be output to
one PCollection or the other. Without doing a little research, I don't
recall if this is doable in the way you need.

Kenn

On Wed, Dec 23, 2020 at 3:12 PM Ismaël Mejía <ie...@gmail.com> wrote:

> Thanks for the answer Robert. Producing a combiner with two lists as
> outputs was one idea I was considering too but I was afraid of
> OutOfMemory issues. I had not thought much about the consequences on
> combining state, thanks for pointing that. For the particular sampling
> use case it might be not an issue, or am I missing something?
>
> I am still curious if for Sampling there could be another approach to
> achieve the same goal of producing the same result (uniform sample +
> the rest) but without the issues of combining.
>
> On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw <ro...@google.com>
> wrote:
> >
> > There are two ways to emit multiple outputs: either to multiple distinct
> PCollections (e.g. withOutputTags) or multiple (including 0) outputs to a
> single PCollection (the difference between Map and FlatMap). In full
> generality, one can always have a CombineFn that outputs lists (say <tag,
> result>*) followed by a DoFn that emits to multiple places based on this
> result.
> >
> > One other cons of emitting multiple values from a CombineFn is that they
> are used in other contexts as well, e.g. combining state, and trying to
> make sense of a multi-outputting CombineFn in that context is trickier.
> >
> > Note that for Sample in particular, it works as a CombineFn because we
> throw most of the data away. If we kept most of the data, it likely
> wouldn't fit into one machine to do the final sampling. The idea of using a
> side input to filter after the fact should work well (unless there's
> duplicate elements, in which case you'd have to uniquify them somehow to
> filter out only the "right" copies).
> >
> > - Robert
> >
> >
> >
> > On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <ie...@gmail.com> wrote:
> >>
> >> I had a question today from one of our users about Beam’s Sample
> >> transform (a Combine with an internal top-like function to produce a
> >> uniform sample of size n of a PCollection). They wanted to obtain also
> >> the rest of the PCollection as an output (the non sampled elements).
> >>
> >> My suggestion was to use the sample (since it was little) as a side
> >> input and then reprocess the collection to filter its elements,
> >> however I wonder if this is the ‘best’ solution.
> >>
> >> I was thinking also if Combine is essentially GbK + ParDo why we don’t
> >> have a Combine function with multiple outputs (maybe an evolution of
> >> CombineWithContext). I know this sounds weird and I have probably not
> >> thought much about issues or the performance of the translation but I
> >> wanted to see what others thought, does this make sense, do you see
> >> some pros/cons or other ideas.
> >>
> >> Thanks,
> >> Ismaël
>

Re: Combine with multiple outputs case Sample and the rest

Posted by Etienne Chauchot <ec...@apache.org>.
Hi all,

Regarding leveraging the Pardo part of Combine (Combine <=> GBK + Pardo) 
to have multiple outputs, please note that most of the time Combine is 
translated by the runners with a native (destination-tech) Combine and 
not a GBK + Pardo.

Regarding using the Stateful DoFn I agree with Kenn with the little 
exception that Statefull DoFn is not supported in streaming mode with 
Spark runner.

But I guess, Ismaël, that the use case is batch mode.

Best

Etienne

On 05/01/2021 15:00, Kenneth Knowles wrote:
> Perhaps something based on stateful DoFn so there is a simple decision 
> point at which each element is either sampled or not so it can be 
> output to one PCollection or the other. Without doing a little 
> research, I don't recall if this is doable in the way you need.
>
> Kenn
>
> On Wed, Dec 23, 2020 at 3:12 PM Ismaël Mejía <iemejia@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Thanks for the answer Robert. Producing a combiner with two lists as
>     outputs was one idea I was considering too but I was afraid of
>     OutOfMemory issues. I had not thought much about the consequences on
>     combining state, thanks for pointing that. For the particular sampling
>     use case it might be not an issue, or am I missing something?
>
>     I am still curious if for Sampling there could be another approach to
>     achieve the same goal of producing the same result (uniform sample +
>     the rest) but without the issues of combining.
>
>     On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw
>     <robertwb@google.com <ma...@google.com>> wrote:
>     >
>     > There are two ways to emit multiple outputs: either to multiple
>     distinct PCollections (e.g. withOutputTags) or multiple (including
>     0) outputs to a single PCollection (the difference between Map and
>     FlatMap). In full generality, one can always have a CombineFn that
>     outputs lists (say <tag, result>*) followed by a DoFn that emits
>     to multiple places based on this result.
>     >
>     > One other cons of emitting multiple values from a CombineFn is
>     that they are used in other contexts as well, e.g. combining
>     state, and trying to make sense of a multi-outputting CombineFn in
>     that context is trickier.
>     >
>     > Note that for Sample in particular, it works as a CombineFn
>     because we throw most of the data away. If we kept most of the
>     data, it likely wouldn't fit into one machine to do the final
>     sampling. The idea of using a side input to filter after the fact
>     should work well (unless there's duplicate elements, in which case
>     you'd have to uniquify them somehow to filter out only the "right"
>     copies).
>     >
>     > - Robert
>     >
>     >
>     >
>     > On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <iemejia@gmail.com
>     <ma...@gmail.com>> wrote:
>     >>
>     >> I had a question today from one of our users about Beam’s Sample
>     >> transform (a Combine with an internal top-like function to
>     produce a
>     >> uniform sample of size n of a PCollection). They wanted to
>     obtain also
>     >> the rest of the PCollection as an output (the non sampled
>     elements).
>     >>
>     >> My suggestion was to use the sample (since it was little) as a side
>     >> input and then reprocess the collection to filter its elements,
>     >> however I wonder if this is the ‘best’ solution.
>     >>
>     >> I was thinking also if Combine is essentially GbK + ParDo why
>     we don’t
>     >> have a Combine function with multiple outputs (maybe an
>     evolution of
>     >> CombineWithContext). I know this sounds weird and I have
>     probably not
>     >> thought much about issues or the performance of the translation
>     but I
>     >> wanted to see what others thought, does this make sense, do you see
>     >> some pros/cons or other ideas.
>     >>
>     >> Thanks,
>     >> Ismaël
>