You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Thomas Weise <th...@apache.org> on 2018/10/17 15:45:19 UTC

Python SDK worker / portable Flink runner performance improvements

Hi,

As you may have noticed, some of the contributors are working on enabling
the Python support on Flink. The upcoming 2.8 release is going to include
much of the functionality and we are now shifting gears to stability and
performance.

There have been some basic fixes already (logging, memory leak) and at this
point we see very low throughput in streaming mode. Improvements are
in-flight:

https://issues.apache.org/jira/browse/BEAM-5760
https://issues.apache.org/jira/browse/BEAM-5521

There has been discussion and preliminary work to improve support for
testing as well (streaming mode). The Python SDK currently doesn't have any
(open source) streaming connectors, but we have added a Flink native
transform that can be used for testing:

https://issues.apache.org/jira/browse/BEAM-5707

I'm starting this thread here so that it is easier for more folks to get
involved and stay in sync.

Thanks,
Thomas

Re: Python SDK worker / portable Flink runner performance improvements

Posted by Thomas Weise <th...@apache.org>.
Regarding the functionality:

https://s.apache.org/apache-beam-portability-support-table

While we still have a good chunk of work to do, the MVP feature set is in
place and allows to run pipelines.

Before we check P2 (feature complete), I would like to see (in addition to
what Max mentioned):

* Support for metrics (user defined and sdk worker - sdk worker is
currently a black box): It should be possible to get both of these as Flink
metrics to support existing observability infrastructure.
* Support for streaming connectors at least in the Python SDK

The support table should reflect that (some rows and JIRAs are currently
missing).

Thomas


On Fri, Oct 19, 2018 at 7:11 AM Kenneth Knowles <ke...@apache.org> wrote:

> This is really cool news. Pretty awesome to move from the "get it to run"
> phase to the "get it to run faster" phase of this project.
>
> Streaming testing: In Java there's a synthetic source (GenerateSequence /
> CountingSource) for testing. Maybe in this case I'd say porting to py is
> worth it?
>
> Kenn
>
> On Wed, Oct 17, 2018 at 2:00 PM Lukasz Cwik <lc...@google.com> wrote:
>
>> Thanks, this was useful for me since I have been away these past couple
>> of weeks.
>>
>> On Wed, Oct 17, 2018 at 8:45 AM Thomas Weise <th...@apache.org> wrote:
>>
>>> Hi,
>>>
>>> As you may have noticed, some of the contributors are working on
>>> enabling the Python support on Flink. The upcoming 2.8 release is going to
>>> include much of the functionality and we are now shifting gears to
>>> stability and performance.
>>>
>>> There have been some basic fixes already (logging, memory leak) and at
>>> this point we see very low throughput in streaming mode. Improvements are
>>> in-flight:
>>>
>>> https://issues.apache.org/jira/browse/BEAM-5760
>>> https://issues.apache.org/jira/browse/BEAM-5521
>>>
>>> There has been discussion and preliminary work to improve support for
>>> testing as well (streaming mode). The Python SDK currently doesn't have any
>>> (open source) streaming connectors, but we have added a Flink native
>>> transform that can be used for testing:
>>>
>>> https://issues.apache.org/jira/browse/BEAM-5707
>>>
>>> I'm starting this thread here so that it is easier for more folks to get
>>> involved and stay in sync.
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>>
>>>

Re: Python SDK worker / portable Flink runner performance improvements

Posted by Kenneth Knowles <ke...@apache.org>.
This is really cool news. Pretty awesome to move from the "get it to run"
phase to the "get it to run faster" phase of this project.

Streaming testing: In Java there's a synthetic source (GenerateSequence /
CountingSource) for testing. Maybe in this case I'd say porting to py is
worth it?

Kenn

On Wed, Oct 17, 2018 at 2:00 PM Lukasz Cwik <lc...@google.com> wrote:

> Thanks, this was useful for me since I have been away these past couple of
> weeks.
>
> On Wed, Oct 17, 2018 at 8:45 AM Thomas Weise <th...@apache.org> wrote:
>
>> Hi,
>>
>> As you may have noticed, some of the contributors are working on enabling
>> the Python support on Flink. The upcoming 2.8 release is going to include
>> much of the functionality and we are now shifting gears to stability and
>> performance.
>>
>> There have been some basic fixes already (logging, memory leak) and at
>> this point we see very low throughput in streaming mode. Improvements are
>> in-flight:
>>
>> https://issues.apache.org/jira/browse/BEAM-5760
>> https://issues.apache.org/jira/browse/BEAM-5521
>>
>> There has been discussion and preliminary work to improve support for
>> testing as well (streaming mode). The Python SDK currently doesn't have any
>> (open source) streaming connectors, but we have added a Flink native
>> transform that can be used for testing:
>>
>> https://issues.apache.org/jira/browse/BEAM-5707
>>
>> I'm starting this thread here so that it is easier for more folks to get
>> involved and stay in sync.
>>
>> Thanks,
>> Thomas
>>
>>
>>
>>

Re: Python SDK worker / portable Flink runner performance improvements

Posted by Maximilian Michels <mx...@apache.org>.
Thanks Thomas, I think it is important to start looking at performance 
and improved test coverage.

While we have the basic functionality, there is still state and timers 
to be implemented for the Portable FlinkRunner. These two will allow 
full testing/optimization:

State:  https://issues.apache.org/jira/browse/BEAM-2918 (pending PR)
Timers: https://issues.apache.org/jira/browse/BEAM-4681

-Max

On 17.10.18 22:59, Lukasz Cwik wrote:
> Thanks, this was useful for me since I haveĀ been away these past couple 
> of weeks.
> 
> On Wed, Oct 17, 2018 at 8:45 AM Thomas Weise <thw@apache.org 
> <ma...@apache.org>> wrote:
> 
>     Hi,
> 
>     As you may have noticed, some of the contributors are working on
>     enabling the Python support on Flink. The upcoming 2.8 release is
>     going to include much of the functionality and we are now shifting
>     gears to stability and performance.
> 
>     There have been some basic fixes already (logging, memory leak) and
>     at this point we see very low throughput in streaming mode.
>     Improvements are in-flight:
> 
>     https://issues.apache.org/jira/browse/BEAM-5760
>     https://issues.apache.org/jira/browse/BEAM-5521
> 
>     There has been discussion and preliminary work to improve support
>     for testing as well (streaming mode). The Python SDK currently
>     doesn't have any (open source) streaming connectors, but we have
>     added a Flink native transform that can be used for testing:
> 
>     https://issues.apache.org/jira/browse/BEAM-5707
> 
>     I'm starting this thread here so that it is easier for more folks to
>     get involved and stay in sync.
> 
>     Thanks,
>     Thomas
> 
> 
> 

Re: Python SDK worker / portable Flink runner performance improvements

Posted by Lukasz Cwik <lc...@google.com>.
Thanks, this was useful for me since I have been away these past couple of
weeks.

On Wed, Oct 17, 2018 at 8:45 AM Thomas Weise <th...@apache.org> wrote:

> Hi,
>
> As you may have noticed, some of the contributors are working on enabling
> the Python support on Flink. The upcoming 2.8 release is going to include
> much of the functionality and we are now shifting gears to stability and
> performance.
>
> There have been some basic fixes already (logging, memory leak) and at
> this point we see very low throughput in streaming mode. Improvements are
> in-flight:
>
> https://issues.apache.org/jira/browse/BEAM-5760
> https://issues.apache.org/jira/browse/BEAM-5521
>
> There has been discussion and preliminary work to improve support for
> testing as well (streaming mode). The Python SDK currently doesn't have any
> (open source) streaming connectors, but we have added a Flink native
> transform that can be used for testing:
>
> https://issues.apache.org/jira/browse/BEAM-5707
>
> I'm starting this thread here so that it is easier for more folks to get
> involved and stay in sync.
>
> Thanks,
> Thomas
>
>
>
>