You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Thomas Weise <th...@apache.org> on 2018/04/03 01:17:09 UTC

Python SDK feature set

Hi,

I’m trying to find a summary of the feature set that is currently supported
in the Python SDK. I understand it is experimental and currently only
supports a subset of the Beam model like fixed interval windows but not
merging windows and custom window functions.

I’m specifically interested in stateful processing and timers as basic
building blocks that could be used with a global window to implement
session functionality and other higher level abstractions without being
constrained by window functions and predefined triggers.

Also since we have the runner capability matrix
<https://beam.apache.org/documentation/runners/capability-matrix/>, would
it be useful to track SDK capabilities in a similar way so that users know
what’s supported?

Thanks,
Thomas

Re: Python SDK feature set

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Thomas,

I think work started about timers, I'm sure the Python guys will provide an update.

Regarding the SDK capability "matrix", it's a good idea, +1.

Regards
JB

On 04/03/2018 03:17 AM, Thomas Weise wrote:
> Hi,
> 
> I’m trying to find a summary of the feature set that is currently supported in
> the Python SDK. I understand it is experimental and currently only supports a
> subset of the Beam model like fixed interval windows but not merging windows and
> custom window functions.
> 
> I’m specifically interested in stateful processing and timers as basic building
> blocks that could be used with a global window to implement session
> functionality and other higher level abstractions without being constrained by
> window functions and predefined triggers.
> 
> Also since we have the runner capability matrix
> <https://beam.apache.org/documentation/runners/capability-matrix/>, would it be
> useful to track SDK capabilities in a similar way so that users know what’s
> supported?
> 
> Thanks,
> Thomas
> 

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Python SDK feature set

Posted by Thomas Weise <th...@apache.org>.

Hi Robert,

Could you please clarify how session / merging windows are supported for
Python pipelines? The SDK has it, which runner supports it and what
limitations are there, if any?

Thanks,
Thomas


On Tue, Apr 3, 2018 at 10:41 AM, Robert Bradshaw <ro...@google.com>
wrote:

> On Tue, Apr 3, 2018 at 10:26 AM Thomas Weise <th...@apache.org> wrote:
>
>>
>> On Mon, Apr 2, 2018 at 9:55 PM, Ahmet Altay <al...@google.com> wrote:
>>
>>>
>>>> I’m specifically interested in stateful processing and timers as basic
>>>> building blocks that could be used with a global window to implement
>>>> session functionality and other higher level abstractions without being
>>>> constrained by window functions and predefined triggers.
>>>>
>>>
> While we certainly want to fill this in, I am curious what processing
> you're doing that's not satisfied by the standard window functions
> (including sessions) and triggers.
>
>
>>
>>> These are missing. Charles started working on those [1]. We can use any
>>> help we can get if you are interested.
>>>
>>
>> Are there any existing discussions or docs for BEAM-2687
>> <https://issues.apache.org/jira/browse/BEAM-2687>?
>>
>
> There are some sketches at https://docs.google.com/document/d/
> 1ClmQ6LqdnfseRzeSw3SL68DAO1f8jsWBL2FfzWErlbw/edit#heading=h.pv99fae1rece
> , but it would be worth fleshing this out into a full proposal.
>
>
>>
>>
>>>
>>>>
>>>> Also since we have the runner capability matrix
>>>> <https://beam.apache.org/documentation/runners/capability-matrix/>,
>>>> would it be useful to track SDK capabilities in a similar way so that
>>>> users know what’s supported?
>>>>
>>>
>>> I think this would be really helpful. Especially now that we have
>>> multiple SDKs in master. Recently Rafael proposed having per-transform
>>> documentation [2]. Building an SDK capability matrix would be natural
>>> extension of it.
>>>
>>
>> To have some basic orientation, it might be good add the Python SDK to
>> the existing matrix? First column currently is "Beam Model", should that
>> become Java SDK instead?
>>
>
> Here again it depends on the runner. By this matrix, the Python SDK has
> the same coverage as "Beam Model" except for "Stateful Processing" and a ~
> for Side Inputs (batch support, streaming is a work in progress) for
> local/dataflow. Both are less complete for Portability-based runners.
>
>
>

Re: Python SDK feature set

Posted by Thomas Weise <th...@apache.org>.

On Tue, Apr 3, 2018 at 10:41 AM, Robert Bradshaw <ro...@google.com>
wrote:

> On Tue, Apr 3, 2018 at 10:26 AM Thomas Weise <th...@apache.org> wrote:
>
>>
>> On Mon, Apr 2, 2018 at 9:55 PM, Ahmet Altay <al...@google.com> wrote:
>>
>>>
>>>> I’m specifically interested in stateful processing and timers as basic
>>>> building blocks that could be used with a global window to implement
>>>> session functionality and other higher level abstractions without being
>>>> constrained by window functions and predefined triggers.
>>>>
>>>
> While we certainly want to fill this in, I am curious what processing
> you're doing that's not satisfied by the standard window functions
> (including sessions) and triggers.
>

To answer why we would want stateful processing, Kenn's blog would be the
first stop:
https://beam.apache.org/blog/2017/02/13/stateful-processing.html

But I would flip it around and ask why would we not start with the lower
level building block that gives us more control and flexibility? The use
cases we are considering are not all simple aggregations that can be filled
with combiner/groupByKey (and potentially also implemented with SQL or
similar abstraction level DSL).

We are evaluating platform support for ultimately more complex pipelines
that require imperative programming. Capabilities such as updating only
relevant portions of state and emitting potentially multiple outputs based
on UDF logic will be important in that context.


>
>
>>
>>> These are missing. Charles started working on those [1]. We can use any
>>> help we can get if you are interested.
>>>
>>
>> Are there any existing discussions or docs for BEAM-2687
>> <https://issues.apache.org/jira/browse/BEAM-2687>?
>>
>
> There are some sketches at https://docs.google.com/document/d/
> 1ClmQ6LqdnfseRzeSw3SL68DAO1f8jsWBL2FfzWErlbw/edit#heading=h.pv99fae1rece
> , but it would be worth fleshing this out into a full proposal.
>
>
>>
>>
>>>
>>>>
>>>> Also since we have the runner capability matrix
>>>> <https://beam.apache.org/documentation/runners/capability-matrix/>,
>>>> would it be useful to track SDK capabilities in a similar way so that
>>>> users know what’s supported?
>>>>
>>>
>>> I think this would be really helpful. Especially now that we have
>>> multiple SDKs in master. Recently Rafael proposed having per-transform
>>> documentation [2]. Building an SDK capability matrix would be natural
>>> extension of it.
>>>
>>
>> To have some basic orientation, it might be good add the Python SDK to
>> the existing matrix? First column currently is "Beam Model", should that
>> become Java SDK instead?
>>
>
> Here again it depends on the runner. By this matrix, the Python SDK has
> the same coverage as "Beam Model" except for "Stateful Processing" and a ~
> for Side Inputs (batch support, streaming is a work in progress) for
> local/dataflow. Both are less complete for Portability-based runners.
>

Indeed, my interest here is portability based runner. But that was the
stated future for all SDKs!

Thanks,
Thomas

Re: Python SDK feature set

Posted by Robert Bradshaw <ro...@google.com>.

On Tue, Apr 3, 2018 at 10:26 AM Thomas Weise <th...@apache.org> wrote:

>
> On Mon, Apr 2, 2018 at 9:55 PM, Ahmet Altay <al...@google.com> wrote:
>
>>
>>> I’m specifically interested in stateful processing and timers as basic
>>> building blocks that could be used with a global window to implement
>>> session functionality and other higher level abstractions without being
>>> constrained by window functions and predefined triggers.
>>>
>>
While we certainly want to fill this in, I am curious what processing
you're doing that's not satisfied by the standard window functions
(including sessions) and triggers.


>
>> These are missing. Charles started working on those [1]. We can use any
>> help we can get if you are interested.
>>
>
> Are there any existing discussions or docs for BEAM-2687
> <https://issues.apache.org/jira/browse/BEAM-2687>?
>

There are some sketches at
https://docs.google.com/document/d/1ClmQ6LqdnfseRzeSw3SL68DAO1f8jsWBL2FfzWErlbw/edit#heading=h.pv99fae1rece
, but it would be worth fleshing this out into a full proposal.


>
>
>>
>>>
>>> Also since we have the runner capability matrix
>>> <https://beam.apache.org/documentation/runners/capability-matrix/>,
>>> would it be useful to track SDK capabilities in a similar way so that
>>> users know what’s supported?
>>>
>>
>> I think this would be really helpful. Especially now that we have
>> multiple SDKs in master. Recently Rafael proposed having per-transform
>> documentation [2]. Building an SDK capability matrix would be natural
>> extension of it.
>>
>
> To have some basic orientation, it might be good add the Python SDK to the
> existing matrix? First column currently is "Beam Model", should that become
> Java SDK instead?
>

Here again it depends on the runner. By this matrix, the Python SDK has the
same coverage as "Beam Model" except for "Stateful Processing" and a ~ for
Side Inputs (batch support, streaming is a work in progress) for
local/dataflow. Both are less complete for Portability-based runners.

Re: Python SDK feature set

Posted by Thomas Weise <th...@apache.org>.

On Mon, Apr 2, 2018 at 9:55 PM, Ahmet Altay <al...@google.com> wrote:

>
>> I’m specifically interested in stateful processing and timers as basic
>> building blocks that could be used with a global window to implement
>> session functionality and other higher level abstractions without being
>> constrained by window functions and predefined triggers.
>>
>
> These are missing. Charles started working on those [1]. We can use any
> help we can get if you are interested.
>


Are there any existing discussions or docs for BEAM-2687
<https://issues.apache.org/jira/browse/BEAM-2687>?


>
>>
>> Also since we have the runner capability matrix
>> <https://beam.apache.org/documentation/runners/capability-matrix/>,
>> would it be useful to track SDK capabilities in a similar way so that
>> users know what’s supported?
>>
>
> I think this would be really helpful. Especially now that we have multiple
> SDKs in master. Recently Rafael proposed having per-transform documentation
> [2]. Building an SDK capability matrix would be natural extension of it.
>

To have some basic orientation, it might be good add the Python SDK to the
existing matrix? First column currently is "Beam Model", should that become
Java SDK instead?


>
> [1] https://issues.apache.org/jira/browse/BEAM-2687
> [2] https://lists.apache.org/thread.html/626427f65a97eb5b40ecb4963202e7
> bdb43fccf139d82add698a7113@%3Cdev.beam.apache.org%3E
>
>
>
>>
>> Thanks,
>> Thomas
>>
>>
>

Re: Python SDK feature set

Posted by Ahmet Altay <al...@google.com>.

On Mon, Apr 2, 2018 at 6:17 PM, Thomas Weise <th...@apache.org> wrote:

> Hi,
>
> I’m trying to find a summary of the feature set that is currently
> supported in the Python SDK. I understand it is experimental and
> currently only supports a subset of the Beam model like fixed interval
> windows but not merging windows and custom window functions.
>
> I’m specifically interested in stateful processing and timers as basic
> building blocks that could be used with a global window to implement
> session functionality and other higher level abstractions without being
> constrained by window functions and predefined triggers.
>

These are missing. Charles started working on those [1]. We can use any
help we can get if you are interested.


>
> Also since we have the runner capability matrix
> <https://beam.apache.org/documentation/runners/capability-matrix/>, would
> it be useful to track SDK capabilities in a similar way so that users
> know what’s supported?
>

I think this would be really helpful. Especially now that we have multiple
SDKs in master. Recently Rafael proposed having per-transform documentation
[2]. Building an SDK capability matrix would be natural extension of it.

[1] https://issues.apache.org/jira/browse/BEAM-2687
[2]
https://lists.apache.org/thread.html/626427f65a97eb5b40ecb4963202e7bdb43fccf139d82add698a7113@%3Cdev.beam.apache.org%3E



>
> Thanks,
> Thomas
>
>