You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Ismaël Mejía <ie...@gmail.com> on 2020/12/18 16:19:56 UTC

Combine with multiple outputs case Sample and the rest

I had a question today from one of our users about Beam’s Sample
transform (a Combine with an internal top-like function to produce a
uniform sample of size n of a PCollection). They wanted to obtain also
the rest of the PCollection as an output (the non sampled elements).

My suggestion was to use the sample (since it was little) as a side
input and then reprocess the collection to filter its elements,
however I wonder if this is the ‘best’ solution.

I was thinking also if Combine is essentially GbK + ParDo why we don’t
have a Combine function with multiple outputs (maybe an evolution of
CombineWithContext). I know this sounds weird and I have probably not
thought much about issues or the performance of the translation but I
wanted to see what others thought, does this make sense, do you see
some pros/cons or other ideas.

Thanks,
Ismaël

Re: Combine with multiple outputs case Sample and the rest

Posted by Etienne Chauchot <ec...@apache.org>.
Hi all,

Regarding leveraging the Pardo part of Combine (Combine <=> GBK + Pardo) 
to have multiple outputs, please note that most of the time Combine is 
translated by the runners with a native (destination-tech) Combine and 
not a GBK + Pardo.

Regarding using the Stateful DoFn I agree with Kenn with the little 
exception that Statefull DoFn is not supported in streaming mode with 
Spark runner.

But I guess, Ismaël, that the use case is batch mode.

Best

Etienne

On 05/01/2021 15:00, Kenneth Knowles wrote:
> Perhaps something based on stateful DoFn so there is a simple decision 
> point at which each element is either sampled or not so it can be 
> output to one PCollection or the other. Without doing a little 
> research, I don't recall if this is doable in the way you need.
>
> Kenn
>
> On Wed, Dec 23, 2020 at 3:12 PM Ismaël Mejía <iemejia@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Thanks for the answer Robert. Producing a combiner with two lists as
>     outputs was one idea I was considering too but I was afraid of
>     OutOfMemory issues. I had not thought much about the consequences on
>     combining state, thanks for pointing that. For the particular sampling
>     use case it might be not an issue, or am I missing something?
>
>     I am still curious if for Sampling there could be another approach to
>     achieve the same goal of producing the same result (uniform sample +
>     the rest) but without the issues of combining.
>
>     On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw
>     <robertwb@google.com <ma...@google.com>> wrote:
>     >
>     > There are two ways to emit multiple outputs: either to multiple
>     distinct PCollections (e.g. withOutputTags) or multiple (including
>     0) outputs to a single PCollection (the difference between Map and
>     FlatMap). In full generality, one can always have a CombineFn that
>     outputs lists (say <tag, result>*) followed by a DoFn that emits
>     to multiple places based on this result.
>     >
>     > One other cons of emitting multiple values from a CombineFn is
>     that they are used in other contexts as well, e.g. combining
>     state, and trying to make sense of a multi-outputting CombineFn in
>     that context is trickier.
>     >
>     > Note that for Sample in particular, it works as a CombineFn
>     because we throw most of the data away. If we kept most of the
>     data, it likely wouldn't fit into one machine to do the final
>     sampling. The idea of using a side input to filter after the fact
>     should work well (unless there's duplicate elements, in which case
>     you'd have to uniquify them somehow to filter out only the "right"
>     copies).
>     >
>     > - Robert
>     >
>     >
>     >
>     > On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <iemejia@gmail.com
>     <ma...@gmail.com>> wrote:
>     >>
>     >> I had a question today from one of our users about Beam’s Sample
>     >> transform (a Combine with an internal top-like function to
>     produce a
>     >> uniform sample of size n of a PCollection). They wanted to
>     obtain also
>     >> the rest of the PCollection as an output (the non sampled
>     elements).
>     >>
>     >> My suggestion was to use the sample (since it was little) as a side
>     >> input and then reprocess the collection to filter its elements,
>     >> however I wonder if this is the ‘best’ solution.
>     >>
>     >> I was thinking also if Combine is essentially GbK + ParDo why
>     we don’t
>     >> have a Combine function with multiple outputs (maybe an
>     evolution of
>     >> CombineWithContext). I know this sounds weird and I have
>     probably not
>     >> thought much about issues or the performance of the translation
>     but I
>     >> wanted to see what others thought, does this make sense, do you see
>     >> some pros/cons or other ideas.
>     >>
>     >> Thanks,
>     >> Ismaël
>

Re: Combine with multiple outputs case Sample and the rest

Posted by Kenneth Knowles <ke...@apache.org>.
Perhaps something based on stateful DoFn so there is a simple decision
point at which each element is either sampled or not so it can be output to
one PCollection or the other. Without doing a little research, I don't
recall if this is doable in the way you need.

Kenn

On Wed, Dec 23, 2020 at 3:12 PM Ismaël Mejía <ie...@gmail.com> wrote:

> Thanks for the answer Robert. Producing a combiner with two lists as
> outputs was one idea I was considering too but I was afraid of
> OutOfMemory issues. I had not thought much about the consequences on
> combining state, thanks for pointing that. For the particular sampling
> use case it might be not an issue, or am I missing something?
>
> I am still curious if for Sampling there could be another approach to
> achieve the same goal of producing the same result (uniform sample +
> the rest) but without the issues of combining.
>
> On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw <ro...@google.com>
> wrote:
> >
> > There are two ways to emit multiple outputs: either to multiple distinct
> PCollections (e.g. withOutputTags) or multiple (including 0) outputs to a
> single PCollection (the difference between Map and FlatMap). In full
> generality, one can always have a CombineFn that outputs lists (say <tag,
> result>*) followed by a DoFn that emits to multiple places based on this
> result.
> >
> > One other cons of emitting multiple values from a CombineFn is that they
> are used in other contexts as well, e.g. combining state, and trying to
> make sense of a multi-outputting CombineFn in that context is trickier.
> >
> > Note that for Sample in particular, it works as a CombineFn because we
> throw most of the data away. If we kept most of the data, it likely
> wouldn't fit into one machine to do the final sampling. The idea of using a
> side input to filter after the fact should work well (unless there's
> duplicate elements, in which case you'd have to uniquify them somehow to
> filter out only the "right" copies).
> >
> > - Robert
> >
> >
> >
> > On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <ie...@gmail.com> wrote:
> >>
> >> I had a question today from one of our users about Beam’s Sample
> >> transform (a Combine with an internal top-like function to produce a
> >> uniform sample of size n of a PCollection). They wanted to obtain also
> >> the rest of the PCollection as an output (the non sampled elements).
> >>
> >> My suggestion was to use the sample (since it was little) as a side
> >> input and then reprocess the collection to filter its elements,
> >> however I wonder if this is the ‘best’ solution.
> >>
> >> I was thinking also if Combine is essentially GbK + ParDo why we don’t
> >> have a Combine function with multiple outputs (maybe an evolution of
> >> CombineWithContext). I know this sounds weird and I have probably not
> >> thought much about issues or the performance of the translation but I
> >> wanted to see what others thought, does this make sense, do you see
> >> some pros/cons or other ideas.
> >>
> >> Thanks,
> >> Ismaël
>

Re: Combine with multiple outputs case Sample and the rest

Posted by Kenneth Knowles <ke...@apache.org>.
Perhaps something based on stateful DoFn so there is a simple decision
point at which each element is either sampled or not so it can be output to
one PCollection or the other. Without doing a little research, I don't
recall if this is doable in the way you need.

Kenn

On Wed, Dec 23, 2020 at 3:12 PM Ismaël Mejía <ie...@gmail.com> wrote:

> Thanks for the answer Robert. Producing a combiner with two lists as
> outputs was one idea I was considering too but I was afraid of
> OutOfMemory issues. I had not thought much about the consequences on
> combining state, thanks for pointing that. For the particular sampling
> use case it might be not an issue, or am I missing something?
>
> I am still curious if for Sampling there could be another approach to
> achieve the same goal of producing the same result (uniform sample +
> the rest) but without the issues of combining.
>
> On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw <ro...@google.com>
> wrote:
> >
> > There are two ways to emit multiple outputs: either to multiple distinct
> PCollections (e.g. withOutputTags) or multiple (including 0) outputs to a
> single PCollection (the difference between Map and FlatMap). In full
> generality, one can always have a CombineFn that outputs lists (say <tag,
> result>*) followed by a DoFn that emits to multiple places based on this
> result.
> >
> > One other cons of emitting multiple values from a CombineFn is that they
> are used in other contexts as well, e.g. combining state, and trying to
> make sense of a multi-outputting CombineFn in that context is trickier.
> >
> > Note that for Sample in particular, it works as a CombineFn because we
> throw most of the data away. If we kept most of the data, it likely
> wouldn't fit into one machine to do the final sampling. The idea of using a
> side input to filter after the fact should work well (unless there's
> duplicate elements, in which case you'd have to uniquify them somehow to
> filter out only the "right" copies).
> >
> > - Robert
> >
> >
> >
> > On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <ie...@gmail.com> wrote:
> >>
> >> I had a question today from one of our users about Beam’s Sample
> >> transform (a Combine with an internal top-like function to produce a
> >> uniform sample of size n of a PCollection). They wanted to obtain also
> >> the rest of the PCollection as an output (the non sampled elements).
> >>
> >> My suggestion was to use the sample (since it was little) as a side
> >> input and then reprocess the collection to filter its elements,
> >> however I wonder if this is the ‘best’ solution.
> >>
> >> I was thinking also if Combine is essentially GbK + ParDo why we don’t
> >> have a Combine function with multiple outputs (maybe an evolution of
> >> CombineWithContext). I know this sounds weird and I have probably not
> >> thought much about issues or the performance of the translation but I
> >> wanted to see what others thought, does this make sense, do you see
> >> some pros/cons or other ideas.
> >>
> >> Thanks,
> >> Ismaël
>

Re: Combine with multiple outputs case Sample and the rest

Posted by Ismaël Mejía <ie...@gmail.com>.
Thanks for the answer Robert. Producing a combiner with two lists as
outputs was one idea I was considering too but I was afraid of
OutOfMemory issues. I had not thought much about the consequences on
combining state, thanks for pointing that. For the particular sampling
use case it might be not an issue, or am I missing something?

I am still curious if for Sampling there could be another approach to
achieve the same goal of producing the same result (uniform sample +
the rest) but without the issues of combining.

On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw <ro...@google.com> wrote:
>
> There are two ways to emit multiple outputs: either to multiple distinct PCollections (e.g. withOutputTags) or multiple (including 0) outputs to a single PCollection (the difference between Map and FlatMap). In full generality, one can always have a CombineFn that outputs lists (say <tag, result>*) followed by a DoFn that emits to multiple places based on this result.
>
> One other cons of emitting multiple values from a CombineFn is that they are used in other contexts as well, e.g. combining state, and trying to make sense of a multi-outputting CombineFn in that context is trickier.
>
> Note that for Sample in particular, it works as a CombineFn because we throw most of the data away. If we kept most of the data, it likely wouldn't fit into one machine to do the final sampling. The idea of using a side input to filter after the fact should work well (unless there's duplicate elements, in which case you'd have to uniquify them somehow to filter out only the "right" copies).
>
> - Robert
>
>
>
> On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>
>> I had a question today from one of our users about Beam’s Sample
>> transform (a Combine with an internal top-like function to produce a
>> uniform sample of size n of a PCollection). They wanted to obtain also
>> the rest of the PCollection as an output (the non sampled elements).
>>
>> My suggestion was to use the sample (since it was little) as a side
>> input and then reprocess the collection to filter its elements,
>> however I wonder if this is the ‘best’ solution.
>>
>> I was thinking also if Combine is essentially GbK + ParDo why we don’t
>> have a Combine function with multiple outputs (maybe an evolution of
>> CombineWithContext). I know this sounds weird and I have probably not
>> thought much about issues or the performance of the translation but I
>> wanted to see what others thought, does this make sense, do you see
>> some pros/cons or other ideas.
>>
>> Thanks,
>> Ismaël

Re: Combine with multiple outputs case Sample and the rest

Posted by Ismaël Mejía <ie...@gmail.com>.
Thanks for the answer Robert. Producing a combiner with two lists as
outputs was one idea I was considering too but I was afraid of
OutOfMemory issues. I had not thought much about the consequences on
combining state, thanks for pointing that. For the particular sampling
use case it might be not an issue, or am I missing something?

I am still curious if for Sampling there could be another approach to
achieve the same goal of producing the same result (uniform sample +
the rest) but without the issues of combining.

On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw <ro...@google.com> wrote:
>
> There are two ways to emit multiple outputs: either to multiple distinct PCollections (e.g. withOutputTags) or multiple (including 0) outputs to a single PCollection (the difference between Map and FlatMap). In full generality, one can always have a CombineFn that outputs lists (say <tag, result>*) followed by a DoFn that emits to multiple places based on this result.
>
> One other cons of emitting multiple values from a CombineFn is that they are used in other contexts as well, e.g. combining state, and trying to make sense of a multi-outputting CombineFn in that context is trickier.
>
> Note that for Sample in particular, it works as a CombineFn because we throw most of the data away. If we kept most of the data, it likely wouldn't fit into one machine to do the final sampling. The idea of using a side input to filter after the fact should work well (unless there's duplicate elements, in which case you'd have to uniquify them somehow to filter out only the "right" copies).
>
> - Robert
>
>
>
> On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>
>> I had a question today from one of our users about Beam’s Sample
>> transform (a Combine with an internal top-like function to produce a
>> uniform sample of size n of a PCollection). They wanted to obtain also
>> the rest of the PCollection as an output (the non sampled elements).
>>
>> My suggestion was to use the sample (since it was little) as a side
>> input and then reprocess the collection to filter its elements,
>> however I wonder if this is the ‘best’ solution.
>>
>> I was thinking also if Combine is essentially GbK + ParDo why we don’t
>> have a Combine function with multiple outputs (maybe an evolution of
>> CombineWithContext). I know this sounds weird and I have probably not
>> thought much about issues or the performance of the translation but I
>> wanted to see what others thought, does this make sense, do you see
>> some pros/cons or other ideas.
>>
>> Thanks,
>> Ismaël

Re: Combine with multiple outputs case Sample and the rest

Posted by Robert Bradshaw <ro...@google.com>.
There are two ways to emit multiple outputs: either to multiple distinct
PCollections (e.g. withOutputTags) or multiple (including 0) outputs to a
single PCollection (the difference between Map and FlatMap). In full
generality, one can always have a CombineFn that outputs lists (say <tag,
result>*) followed by a DoFn that emits to multiple places based on this
result.

One other cons of emitting multiple values from a CombineFn is that they
are used in other contexts as well, e.g. combining state, and trying to
make sense of a multi-outputting CombineFn in that context is trickier.

Note that for Sample in particular, it works as a CombineFn because we
throw most of the data away. If we kept most of the data, it likely
wouldn't fit into one machine to do the final sampling. The idea of using a
side input to filter after the fact should work well (unless there's
duplicate elements, in which case you'd have to uniquify them somehow
to filter out only the "right" copies).

- Robert



On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <ie...@gmail.com> wrote:

> I had a question today from one of our users about Beam’s Sample
> transform (a Combine with an internal top-like function to produce a
> uniform sample of size n of a PCollection). They wanted to obtain also
> the rest of the PCollection as an output (the non sampled elements).
>
> My suggestion was to use the sample (since it was little) as a side
> input and then reprocess the collection to filter its elements,
> however I wonder if this is the ‘best’ solution.
>
> I was thinking also if Combine is essentially GbK + ParDo why we don’t
> have a Combine function with multiple outputs (maybe an evolution of
> CombineWithContext). I know this sounds weird and I have probably not
> thought much about issues or the performance of the translation but I
> wanted to see what others thought, does this make sense, do you see
> some pros/cons or other ideas.
>
> Thanks,
> Ismaël
>

Re: Combine with multiple outputs case Sample and the rest

Posted by Robert Bradshaw <ro...@google.com>.
There are two ways to emit multiple outputs: either to multiple distinct
PCollections (e.g. withOutputTags) or multiple (including 0) outputs to a
single PCollection (the difference between Map and FlatMap). In full
generality, one can always have a CombineFn that outputs lists (say <tag,
result>*) followed by a DoFn that emits to multiple places based on this
result.

One other cons of emitting multiple values from a CombineFn is that they
are used in other contexts as well, e.g. combining state, and trying to
make sense of a multi-outputting CombineFn in that context is trickier.

Note that for Sample in particular, it works as a CombineFn because we
throw most of the data away. If we kept most of the data, it likely
wouldn't fit into one machine to do the final sampling. The idea of using a
side input to filter after the fact should work well (unless there's
duplicate elements, in which case you'd have to uniquify them somehow
to filter out only the "right" copies).

- Robert



On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía <ie...@gmail.com> wrote:

> I had a question today from one of our users about Beam’s Sample
> transform (a Combine with an internal top-like function to produce a
> uniform sample of size n of a PCollection). They wanted to obtain also
> the rest of the PCollection as an output (the non sampled elements).
>
> My suggestion was to use the sample (since it was little) as a side
> input and then reprocess the collection to filter its elements,
> however I wonder if this is the ‘best’ solution.
>
> I was thinking also if Combine is essentially GbK + ParDo why we don’t
> have a Combine function with multiple outputs (maybe an evolution of
> CombineWithContext). I know this sounds weird and I have probably not
> thought much about issues or the performance of the translation but I
> wanted to see what others thought, does this make sense, do you see
> some pros/cons or other ideas.
>
> Thanks,
> Ismaël
>