You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Brian Hulette <bh...@google.com> on 2021/12/15 17:59:22 UTC

[PROPOSAL] Batched DoFns in the Python SDK

Hi all,

I've drafted a proposal to add a "Batched DoFn" concept to the Beam Python
SDK. The primary motivation is to make it more natural to draft vectorized
Beam pipelines by allowing DoFns to produce and/or consume batches of
logical elements. You can find the proposal here:

https://s.apache.org/batched-dofns

Please take a look and let me know what you think.

Thanks!
Brian

Re: [PROPOSAL] Batched DoFns in the Python SDK

Posted by Brian Hulette <bh...@google.com>.

No, there aren't any significant outstanding issues, I just wanted to make
sure the community had a chance to review this before I start working on an
implementation next week.


On Thu, Jan 20, 2022 at 1:58 PM Robert Bradshaw <ro...@google.com> wrote:

> Thanks for the ping. As you saw, I left plenty of comments in the doc.
> Are there any outstanding issues you'd particularly like the
> community's feedback on?
>
> On Wed, Jan 19, 2022 at 11:59 AM Chad Dombrova <ch...@gmail.com> wrote:
> >>
> >> Thanks Chad I'll take a look at your talk and design to see if there's
> any ideas we can merge.
> >
> >
> > Thanks Brian.  My hope is that even if you don't add the complete
> scheduling framework, we'll get all the features and hooks we need to build
> our toolset without needing to modify beam code (which is technical debt
> that I'd rather not have).  Then we can offer our tool on PyPI for those
> who want it.  I'm happy to have a call with you and the beam team to
> discuss the details once you've had a look.
> >
> > -chad
> >
>

Re: [PROPOSAL] Batched DoFns in the Python SDK

Posted by Robert Bradshaw <ro...@google.com>.

Thanks for the ping. As you saw, I left plenty of comments in the doc.
Are there any outstanding issues you'd particularly like the
community's feedback on?

On Wed, Jan 19, 2022 at 11:59 AM Chad Dombrova <ch...@gmail.com> wrote:
>>
>> Thanks Chad I'll take a look at your talk and design to see if there's any ideas we can merge.
>
>
> Thanks Brian.  My hope is that even if you don't add the complete scheduling framework, we'll get all the features and hooks we need to build our toolset without needing to modify beam code (which is technical debt that I'd rather not have).  Then we can offer our tool on PyPI for those who want it.  I'm happy to have a call with you and the beam team to discuss the details once you've had a look.
>
> -chad
>

Re: [PROPOSAL] Batched DoFns in the Python SDK

Posted by Chad Dombrova <ch...@gmail.com>.

>
> Thanks Chad I'll take a look at your talk and design to see if there's any
> ideas we can merge.
>

Thanks Brian.  My hope is that even if you don't add the complete
scheduling framework, we'll get all the features and hooks we need to build
our toolset without needing to modify beam code (which is technical debt
that I'd rather not have).  Then we can offer our tool on PyPI for those
who want it.  I'm happy to have a call with you and the beam team to
discuss the details once you've had a look.

-chad

Re: [PROPOSAL] Batched DoFns in the Python SDK

Posted by Brian Hulette <bh...@google.com>.

Thanks Chad I'll take a look at your talk and design to see if there's any
ideas we can merge.

I also just wanted to ping this thread once more - I know it went out just
as many people were taking off for holidays so it might have gotten lost.
The design doc is at https://s.apache.org/batched-dofns.

On Fri, Dec 17, 2021 at 8:57 AM Chad Dombrova <ch...@gmail.com> wrote:

> Hi Brian,
> We implemented a feature that's similar to this, but with a different
> motivation: scheduled tasks.  We had the same need of creating batches of
> logical elements, but rather than perform SIMD-optimized computations, we
> want to produce remotely scheduled tasks.  It's my hope that the underlying
> job of slicing a stream into batches of elements can achieve both goals.
> We've been using this change in production for awhile now, but I would
> really love to get this onto master.  Can you have a look and see how it
> compares to what you have in mind?
>
> Here's the info I sent out about this earlier:
>
> Beam's niche is low latency, high throughput workloads, but Beam has
> incredible promise as an orchestrator of long running work that gets sent
> to a scheduler.  We've created a modified version of Beam that allows the
> python SDK worker to outsource tasks to a scheduler, like Kubernetes batch
> jobs[1], Argo[2], or Google's own OpenCue[3].
>
> The basic idea is that any element in a stream can be tagged to be
> executed outside of the normal SdkWorker as an atomic "task".  A task is
> one invocation of a stage, composed of one or more DoFns, against a slice
> of the data stream, composed of one or more tagged elements.   The upshot
> is that we're able to slice up the processing of a stream across
> potentially *many* workers, with the trade-off being the added overhead
> of starting up a worker process for each task.
>
> For more info on how we use our modified version of Beam to make visual
> effects for feature films, check out the talk[4] I gave at the Beam Summit.
>
> Here's our design doc:
>
> https://docs.google.com/document/d/1GrAvDWwnR1QAmFX7lnNA7I_mQBC2G1V2jE2CZOc6rlw/edit?usp=sharing
>
> And here's the github branch:
> https://github.com/LumaPictures/beam/tree/taskworker_public
>
>
> [1] https://kubernetes.io/docs/concepts/workloads/controllers/job/
> [2] https://argoproj.github.io/
> [3] https://cloud.google.com/opencue
> [4] https://www.youtube.com/watch?v=gvbQI3I03a8&ab_channel=ApacheBeam
>
> -chad
>
>
> On Wed, Dec 15, 2021 at 9:59 AM Brian Hulette <bh...@google.com> wrote:
>
>> Hi all,
>>
>> I've drafted a proposal to add a "Batched DoFn" concept to the Beam
>> Python SDK. The primary motivation is to make it more natural to draft
>> vectorized Beam pipelines by allowing DoFns to produce and/or consume
>> batches of logical elements. You can find the proposal here:
>>
>> https://s.apache.org/batched-dofns
>>
>> Please take a look and let me know what you think.
>>
>> Thanks!
>> Brian
>>
>

Re: [PROPOSAL] Batched DoFns in the Python SDK

Posted by Chad Dombrova <ch...@gmail.com>.

Hi Brian,
We implemented a feature that's similar to this, but with a different
motivation: scheduled tasks.  We had the same need of creating batches of
logical elements, but rather than perform SIMD-optimized computations, we
want to produce remotely scheduled tasks.  It's my hope that the underlying
job of slicing a stream into batches of elements can achieve both goals.
We've been using this change in production for awhile now, but I would
really love to get this onto master.  Can you have a look and see how it
compares to what you have in mind?

Here's the info I sent out about this earlier:

Beam's niche is low latency, high throughput workloads, but Beam has
incredible promise as an orchestrator of long running work that gets sent
to a scheduler.  We've created a modified version of Beam that allows the
python SDK worker to outsource tasks to a scheduler, like Kubernetes batch
jobs[1], Argo[2], or Google's own OpenCue[3].

The basic idea is that any element in a stream can be tagged to be executed
outside of the normal SdkWorker as an atomic "task".  A task is one
invocation of a stage, composed of one or more DoFns, against a slice of
the data stream, composed of one or more tagged elements.   The upshot is
that we're able to slice up the processing of a stream across potentially
*many* workers, with the trade-off being the added overhead of starting up
a worker process for each task.

For more info on how we use our modified version of Beam to make visual
effects for feature films, check out the talk[4] I gave at the Beam Summit.

Here's our design doc:
https://docs.google.com/document/d/1GrAvDWwnR1QAmFX7lnNA7I_mQBC2G1V2jE2CZOc6rlw/edit?usp=sharing

And here's the github branch:
https://github.com/LumaPictures/beam/tree/taskworker_public

[1] https://kubernetes.io/docs/concepts/workloads/controllers/job/
[2] https://argoproj.github.io/
[3] https://cloud.google.com/opencue
[4] https://www.youtube.com/watch?v=gvbQI3I03a8&ab_channel=ApacheBeam

-chad

On Wed, Dec 15, 2021 at 9:59 AM Brian Hulette <bh...@google.com> wrote:

> Hi all,
>
> I've drafted a proposal to add a "Batched DoFn" concept to the Beam Python
> SDK. The primary motivation is to make it more natural to draft vectorized
> Beam pipelines by allowing DoFns to produce and/or consume batches of
> logical elements. You can find the proposal here:
>
> https://s.apache.org/batched-dofns
>
> Please take a look and let me know what you think.
>
> Thanks!
> Brian
>