You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Eugene Kirpichov <ki...@google.com> on 2018/06/26 00:12:52 UTC

Re: Unbounded source translation for portable pipelines

Hi!

Wanted to let you know that I've just merged the PR that adds
checkpointable SDF support to the portable reference runner (ULR) and the
Java SDK harness:

https://github.com/apache/beam/pull/5566

So now we have a reference implementation of SDF support in a portable
runner, and a reference implementation of SDF support in a portable SDK
harness.
From here on, we need to replicate this support in other portable runners
and other harnesses. The obvious targets are Flink and Python respectively.

Chamikara was going to work on the Python harness. +Thomas Weise
<th...@apache.org> Would you be interested in the Flink portable streaming
runner side? It is of course blocked by having the rest of that runner
working in streaming mode though (the batch mode is practically done - will
send you a separate note about the status of that).

On Fri, Mar 23, 2018 at 12:20 PM Eugene Kirpichov <ki...@google.com>
wrote:

> Luke is right - unbounded sources should go through SDF. I am currently
> working on adding such support to Fn API.
> The relevant document is s.apache.org/beam-breaking-fusion (note: it
> focuses on a much more general case, but also considers in detail the
> specific case of running unbounded sources on Fn API), and the first
> related PR is https://github.com/apache/beam/pull/4743 .
>
> Ways you can help speed up this effort:
> - Make necessary changes to Apex runner per se to support regular SDFs in
> streaming (without portability). They will likely largely carry over to
> portable world. I recall that the Apex runner had some level of support of
> SDFs, but didn't pass the ValidatesRunner tests yet.
> - (general to Beam, not Apex-related per se) Implement the translation of
> Read.from(UnboundedSource) via impulse, which will require implementing an
> SDF that reads from a given UnboundedSource (taking the UnboundedSource as
> an element). This should be fairly straightforward and will allow all
> portable runners to take advantage of existing UnboundedSource's.
>
>
> On Fri, Mar 23, 2018 at 3:08 PM Lukasz Cwik <lc...@google.com> wrote:
>
>> Using impulse is a precursor for both bounded and unbounded SDF.
>>
>> This JIRA represents the work that would be to add support for unbounded
>> SDF using portability APIs:
>> https://issues.apache.org/jira/browse/BEAM-2939
>>
>>
>> On Fri, Mar 23, 2018 at 11:46 AM Thomas Weise <th...@apache.org> wrote:
>>
>>> So for streaming, we will need the Impulse translation for bounded
>>> input, identical with batch, and then in addition to that support for SDF?
>>>
>>> Any pointers what's involved in adding the SDF support? Is it runner
>>> specific? Does the ULR cover it?
>>>
>>>
>>> On Fri, Mar 23, 2018 at 11:26 AM, Lukasz Cwik <lc...@google.com> wrote:
>>>
>>>> All "sources" in portability will use splittable DoFns for execution.
>>>>
>>>> Specifically, runners will need to be able to checkpoint unbounded
>>>> sources to get a minimum viable pipeline working.
>>>> For bounded pipelines, a DoFn can read the contents of a bounded source.
>>>>
>>>>
>>>> On Fri, Mar 23, 2018 at 10:52 AM Thomas Weise <th...@apache.org> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm looking at the portable pipeline translation for streaming. I
>>>>> understand that for batch pipelines, it is sufficient to translate Impulse.
>>>>>
>>>>> What is the intended path to support unbounded sources?
>>>>>
>>>>> The goal here is to get a minimum translation working that will allow
>>>>> streaming wordcount execution.
>>>>>
>>>>> Thanks,
>>>>> Thomas
>>>>>
>>>>>
>>>