You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Ahmet Altay <al...@google.com.INVALID> on 2017/01/17 16:22:17 UTC

[DISCUSS] Python SDK status and next steps

Hi all,

tl;dr: I would like to start a discussion about merging python-sdk branch
to master branch. Python SDK is mature enough and merging it to master will
accelerate its development and adoption.

With a great effort from a lot of contributors(*), Python SDK [1] is now a
mostly complete, tested, performant Python implementation of the Beam
model. Since June, when we first started with Python SDK in Apache Beam we
have been continuously improving it.

** Python SDK currently supports:

* Model: All main concepts are present (ParDo, GroupByKey, Windowing etc.).
* IO: There are extensible APIs for writing new bounded sources and sinks.
Implementations are provided for Text, Avro, BigQuery, and Datastore.
* Runners: Python SDK has an extensible base runner module that allows
building specific runners on top of it. The SDK comes with two pipeline
runners: DirectRunner and DataflowRunner; and it is possible to add more.
The existing runners are currently limited to bounded execution and
otherwise equivalent to their Java SDK counterparts in functionality.
* Testing: Python SDK implements ValidatesRunner test framework for
implementing integration test for current and future runners. There is unit
test coverage for all modules, and a number of integrations test for
validating existing runners.
* Documentation and examples: Documentation work has started on Python SDK.
Beam Programming Guide page has been updated to include Python [2]. The
code comes with many ready to use examples and we are in a good place to
start documenting those on the website.

** We are not done yet, next on the roadmap we have:

* Streaming: Both of the existing runners lack support for streaming
execution, and currently there is work going on for adding streaming
support to DirectRunner [3].
* Documentation: Filling the rest of the Beam documentations with Python
SDK specific information and examples.
* SDK consistency: Making Python SDK consistent with the Java SDK. We have
come a long way on this and have only a few items left [4].
* Beamifying: We have been working on removing Dataflow-specific references
both from the documentation and from the code. There is some work left, and
we are currently working on those as well [5].

** Steps and implications of merging to master:

* Master branch is merged to python-sdk branch at regular intervals and the
last merge was on 12/22. All the past merges were uneventful because there
is a minimal overlap in modified files between branches. Integrating
python-sdk to master will similarly touch a small number of existing files.

* Python SDK is using the same tools for building and testing. It is
already integrated with Maven, Jenkins and Travis. Specifically the impact
to the testing infrastructure would be:
- There will be two additional test configurations in Travis. Since Travis
runs all configurations in parallel there should not be a noticeable change
in the Travis run time.
- Jenkins pre-commit test will start running the Python SDK tests. It will
add an additional 5 minutes to the completion time of pre-commit test.
Historically Python SDK tests were not flaky and did not cause any random
failures.
- Jenkins Python post-commit test is already separated from the other
post-commit tests and will continue to exist. It would not change the
testing time for any other test.

* The release process needs to be updated to accommodate releasing Python
artifacts. Python SDK would fit in the existing release schedule and could
be released along with the Java SDK. The additional steps would include:
- Generating Python artifacts. This could be done with a single command
using Maven today.
- Publishing the artifacts to a central repository such as PyPI.
- Updating the release guide to reflect the changes above.

* Users: There are existing users using the Python SDK. To give a rough
estimate, a distribution of the Beam Python SDK had a total of 23K
downloads in the past 6 months [6]. Some of those users are already engaged
with the community (e.g. [7]). There might be an increased amount
engagement from the rest of them after the merge.

Looking forward to hearing your thoughts and comments on “graduating”
python-sdk to the master.

Thank you,
Ahmet

(*) Python SDK branch currently has a diverse group of contributors.
Regular contributors include Charles Chen, Chamikara Jayalath, María García
Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions from
Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and
Younghee Kwon.

[1] https://github.com/apache/beam/tree/python-sdk/sdks/python
[2] https://beam.apache.org/documentation/programming-guide/
[3] https://issues.apache.org/jira/browse/BEAM-1265
[4]
https://issues.apache.org/jira/issues/?jql=status%20%3D%20Open%20AND%20labels%20%3D%20sdk-consistency
[5] https://issues.apache.org/jira/browse/BEAM-1218
[6] https://pypi.python.org/pypi/google-cloud-dataflow/json
[7] https://issues.apache.org/jira/browse/BEAM-1251

Re: [DISCUSS] Python SDK status and next steps

Posted by Ahmet Altay <al...@google.com.INVALID>.
Thank you Dan. Adding support for unbounded data is on the roadmap and it
will be added to Python SDK soon.

Thank you all again, I will start the official voting thread.

Thank you,
Ahmet

On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin <dh...@google.com.invalid>
wrote:

> I do not think that Python SDK yet meets the bar [1] for implementing the
> Beam model -- supporting Unbounded data is very important. That said, given
> the committed and sustained set of contributors, it generally makes sense
> to me to make an exception in anticipation of these features being fleshed
> out soon; including potentially new users/contributors that would arrive
> once in master.
>
> [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com
>
> On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay <al...@google.com.invalid>
> wrote:
>
> > Thank you all for the comments so far. I would follow the process as
> > suggested by Davor and others in this thread.
> >
> > Ahmet
> >
> > On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <wi...@apache.org>
> > wrote:
> >
> > > Hi
> > >
> > > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay <altay@google.com.invalid
> >
> > > wrote:
> > > >
> > > > tl;dr: I would like to start a discussion about merging python-sdk
> > branch
> > > > to master branch. Python SDK is mature enough and merging it to
> master
> > > will
> > > > accelerate its development and adoption.
> > > >
> > >
> > > Good point, Ahmet!
> > >
> > > I've following closed the development since it was imported in June.
> For
> > > the prototypes I've implemented so far it works quite well; I guess
> we'd
> > > just need to focus the next months in bringing more runners support.
> > >
> > > With a great effort from a lot of contributors(*), Python SDK [1] is
> now
> > a
> > > > mostly complete, tested, performant Python implementation of the Beam
> > > > model. Since June, when we first started with Python SDK in Apache
> Beam
> > > we
> > > > have been continuously improving it.
> > > >
> > >
> > > I wouldn't merge during the preparation of 0.5.0 release, but after
> that
> > > could be a good time to merge back into master.
> > >
> > >
> > > ** Python SDK currently supports:
> > > >
> > > > * Model: All main concepts are present (ParDo, GroupByKey, Windowing
> > > etc.).
> > > > * IO: There are extensible APIs for writing new bounded sources and
> > > sinks.
> > > > Implementations are provided for Text, Avro, BigQuery, and Datastore.
> > > > * Runners: Python SDK has an extensible base runner module that
> allows
> > > > building specific runners on top of it. The SDK comes with two
> pipeline
> > > > runners: DirectRunner and DataflowRunner; and it is possible to add
> > more.
> > > > The existing runners are currently limited to bounded execution and
> > > > otherwise equivalent to their Java SDK counterparts in functionality.
> > > >
> > >
> > > What would the effort of porting, and maintaining, parallel versions of
> > the
> > > Java runners? I guess I'd need to dig deeper in the model, but this may
> > > represent a major effort for the project, right?
> > >
> >
> > It is somewhat higher for DirectRunner because DirectRunner also
> implements
> > the code for execution. It is not that high for DataflowRunner because
> the
> > base runner module has a lot of helpers with the right hooks for
> > implementing a generic runner. I would _expect_ the experience in general
> > would be similar to the latter.
> >
> >
> > >
> > >
> > >
> > > > * Testing: Python SDK implements ValidatesRunner test framework for
> > > > implementing integration test for current and future runners. There
> is
> > > unit
> > > > test coverage for all modules, and a number of integrations test for
> > > > validating existing runners.
> > > > * Documentation and examples: Documentation work has started on
> Python
> > > SDK.
> > > > Beam Programming Guide page has been updated to include Python [2].
> The
> > > > code comes with many ready to use examples and we are in a good place
> > to
> > > > start documenting those on the website.
> > > >
> > > > ** We are not done yet, next on the roadmap we have:
> > > >
> > > > * Streaming: Both of the existing runners lack support for streaming
> > > > execution, and currently there is work going on for adding streaming
> > > > support to DirectRunner [3].
> > > > * Documentation: Filling the rest of the Beam documentations with
> > Python
> > > > SDK specific information and examples.
> > > > * SDK consistency: Making Python SDK consistent with the Java SDK. We
> > > have
> > > > come a long way on this and have only a few items left [4].
> > > > * Beamifying: We have been working on removing Dataflow-specific
> > > references
> > > > both from the documentation and from the code. There is some work
> left,
> > > and
> > > > we are currently working on those as well [5].
> > > >
> > > > ** Steps and implications of merging to master:
> > > >
> > > > * Master branch is merged to python-sdk branch at regular intervals
> and
> > > the
> > > > last merge was on 12/22. All the past merges were uneventful because
> > > there
> > > > is a minimal overlap in modified files between branches. Integrating
> > > > python-sdk to master will similarly touch a small number of existing
> > > files.
> > > >
> > > > * Python SDK is using the same tools for building and testing. It is
> > > > already integrated with Maven, Jenkins and Travis. Specifically the
> > > impact
> > > > to the testing infrastructure would be:
> > > > - There will be two additional test configurations in Travis. Since
> > > Travis
> > > > runs all configurations in parallel there should not be a noticeable
> > > change
> > > > in the Travis run time.
> > > > - Jenkins pre-commit test will start running the Python SDK tests. It
> > > will
> > > > add an additional 5 minutes to the completion time of pre-commit
> test.
> > > > Historically Python SDK tests were not flaky and did not cause any
> > random
> > > > failures.
> > > > - Jenkins Python post-commit test is already separated from the other
> > > > post-commit tests and will continue to exist. It would not change the
> > > > testing time for any other test.
> > > >
> > > > * The release process needs to be updated to accommodate releasing
> > Python
> > > > artifacts. Python SDK would fit in the existing release schedule and
> > > could
> > > > be released along with the Java SDK. The additional steps would
> > include:
> > > > - Generating Python artifacts. This could be done with a single
> command
> > > > using Maven today.
> > > > - Publishing the artifacts to a central repository such as PyPI.
> > > >
> > >
> > > I'm more than happy to help on this. We left on purpose some things
> open
> > > when we added Maven support to the Python build.
> > >
> >
> > That would be awesome. We can coordinate on that post-merge.
> >
> >
> > >
> > >
> > >
> > > > - Updating the release guide to reflect the changes above.
> > > >
> > > > * Users: There are existing users using the Python SDK. To give a
> rough
> > > > estimate, a distribution of the Beam Python SDK had a total of 23K
> > > > downloads in the past 6 months [6]. Some of those users are already
> > > engaged
> > > > with the community (e.g. [7]). There might be an increased amount
> > > > engagement from the rest of them after the merge.
> > > >
> > >
> > > Python 3 support is something we definitively need to look ahead. I'd
> try
> > > to make the codebase compatible with both 2.7.x and 3.6.x, rather than
> > > using other  solutions like 2to3.
> > >
> >
> > I agree with you. I think it makes more sense to make codebase compatible
> > with both. As you mentioned Python 3 support is not a short-term goal in
> > the roadmap, and we can discuss it more as we approach that.
> >
> >
> > >
> > >
> > > Looking forward to hearing your thoughts and comments on “graduating”
> > > > python-sdk to the master.
> > > >
> > > > Thank you,
> > > > Ahmet
> > > >
> > > > (*) Python SDK branch currently has a diverse group of contributors.
> > > > Regular contributors include Charles Chen, Chamikara Jayalath, María
> > > García
> > > > Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
> > > > Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions
> > from
> > > > Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and
> > > > Younghee Kwon.
> > > >
> > > > [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
> > > > [2] https://beam.apache.org/documentation/programming-guide/
> > > > [3] https://issues.apache.org/jira/browse/BEAM-1265
> > > > [4]
> > > > https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
> > > > en%20AND%20labels%20%3D%20sdk-consistency
> > > > [5] https://issues.apache.org/jira/browse/BEAM-1218
> > > > [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
> > > > [7] https://issues.apache.org/jira/browse/BEAM-1251
> > > >
> > >
> > >
> > > Great summary, Ahmet. Thanks.
> > >
> > > Cheers,
> > >
> > > --
> > > Sergio Fernández
> > > Partner Technology Manager
> > > Redlink GmbH
> > > m: +43 6602747925
> > > e: sergio.fernandez@redlink.co
> > > w: http://redlink.co
> > >
> >
>

Re: [DISCUSS] Python SDK status and next steps

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Awesome !

Great work guys !

Regards
JB

On 01/31/2017 08:10 AM, Ahmet Altay wrote:
> Hi all,
>
> This merge is completed. Python SDK is now officially part of the master
> branch! Thank you all for the support. Please open an issue, if you notice
> a reference to the now obsolete python-sdk branch in the documentation.
>
> There will not be any more merges to the python-sdk branch. Going forward
> please use the master branch for Python SDK development. There are a few
> existing open PRs to the python-sdk [1]. If you are the author of one of
> those PRs, please rebase them on top of master.
>
> Thank you,
> Ahmet
>
> [1] https://github.com/pulls?utf8=\u2713&q=is%3Aopen+is%3Apr+base%
> 3Apython-sdk+repo%3Aapache%2Fbeam+
> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>
> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles <kl...@google.com.invalid>
> wrote:
>
>> To clarify the implied criteria of that last exchange, it is "An SDK should
>> have at least one runner that can execute the complete model (may be a
>> direct runner)"
>>
>> I want to highlight this, because whether an _SDK_ supports unbounded data
>> is not particularly well-defined, and will evolve:
>>
>>  - With the Runner API, an SDK will need to support building a graph with
>> unbounded constructs, as today with probably minimal changes.
>>
>>  - With the Fn API, if any part of the Fn API is specific to unbounded
>> data, the SDK will need to implement it. I think right now there is no such
>> thing, and we don't want such a thing, so SDKs implementing the Fn API
>> automatically support unbounded data.
>>
>>  - There will also likely be an SDK-specific shim just as there is today,
>> to leverage idiomatic deserialized representations. The richness of this
>> shim will decrease so that it will need to "support" unbounded data but
>> that will be a ~one liner.
>>
>> Getting the Python SDK on master will accelerate our progress towards the
>> Fn API - partly technical, partly community - which is the best path
>> towards support for unbounded data across multiple runners. I think the
>> criteria are written with the completed portability framework in mind. So
>> this exchange makes me actually more convinced we should merge python-sdk
>> to master.
>>
>> On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
>> robertwb@google.com.invalid> wrote:
>>
>>> On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>>> <dh...@google.com.invalid> wrote:
>>>> I do not think that Python SDK yet meets the bar [1] for implementing
>> the
>>>> Beam model -- supporting Unbounded data is very important. That said,
>>> given
>>>> the committed and sustained set of contributors, it generally makes
>> sense
>>>> to me to make an exception in anticipation of these features being
>>> fleshed
>>>> out soon; including potentially new users/contributors that would
>> arrive
>>>> once in master.
>>>>
>>>> [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
>>>> k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com
>>>
>>> That is a valid point. The Python SDK supports all the unbounded parts
>>> of the model except for unbounded sources, which was deferred while
>>> seeing how https://s.apache.org/splittable-do-fn played out. I've been
>>> working with the team and merging/reviewing most of their code, and
>>> have full confidence this will be coming (and on that note can vouch
>>> for a healthy community and support which are much harder to add
>>> later).
>>>
>>> In short, I think it has the required maturity, and I'm in favor of
>>> merging soonish.
>>>
>>>> On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay <altay@google.com.invalid
>>>
>>>> wrote:
>>>>
>>>>> Thank you all for the comments so far. I would follow the process as
>>>>> suggested by Davor and others in this thread.
>>>>>
>>>>> Ahmet
>>>>>
>>>>> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fern�ndez <wikier@apache.org
>>>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
>> <altay@google.com.invalid
>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>> tl;dr: I would like to start a discussion about merging python-sdk
>>>>> branch
>>>>>>> to master branch. Python SDK is mature enough and merging it to
>>> master
>>>>>> will
>>>>>>> accelerate its development and adoption.
>>>>>>>
>>>>>>
>>>>>> Good point, Ahmet!
>>>>>>
>>>>>> I've following closed the development since it was imported in June.
>>> For
>>>>>> the prototypes I've implemented so far it works quite well; I guess
>>> we'd
>>>>>> just need to focus the next months in bringing more runners support.
>>>>>>
>>>>>> With a great effort from a lot of contributors(*), Python SDK [1] is
>>> now
>>>>> a
>>>>>>> mostly complete, tested, performant Python implementation of the
>>> Beam
>>>>>>> model. Since June, when we first started with Python SDK in Apache
>>> Beam
>>>>>> we
>>>>>>> have been continuously improving it.
>>>>>>>
>>>>>>
>>>>>> I wouldn't merge during the preparation of 0.5.0 release, but after
>>> that
>>>>>> could be a good time to merge back into master.
>>>>>>
>>>>>>
>>>>>> ** Python SDK currently supports:
>>>>>>>
>>>>>>> * Model: All main concepts are present (ParDo, GroupByKey,
>> Windowing
>>>>>> etc.).
>>>>>>> * IO: There are extensible APIs for writing new bounded sources
>> and
>>>>>> sinks.
>>>>>>> Implementations are provided for Text, Avro, BigQuery, and
>>> Datastore.
>>>>>>> * Runners: Python SDK has an extensible base runner module that
>>> allows
>>>>>>> building specific runners on top of it. The SDK comes with two
>>> pipeline
>>>>>>> runners: DirectRunner and DataflowRunner; and it is possible to
>> add
>>>>> more.
>>>>>>> The existing runners are currently limited to bounded execution
>> and
>>>>>>> otherwise equivalent to their Java SDK counterparts in
>>> functionality.
>>>>>>>
>>>>>>
>>>>>> What would the effort of porting, and maintaining, parallel versions
>>> of
>>>>> the
>>>>>> Java runners? I guess I'd need to dig deeper in the model, but this
>>> may
>>>>>> represent a major effort for the project, right?
>>>>>>
>>>>>
>>>>> It is somewhat higher for DirectRunner because DirectRunner also
>>> implements
>>>>> the code for execution. It is not that high for DataflowRunner because
>>> the
>>>>> base runner module has a lot of helpers with the right hooks for
>>>>> implementing a generic runner. I would _expect_ the experience in
>>> general
>>>>> would be similar to the latter.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> * Testing: Python SDK implements ValidatesRunner test framework
>> for
>>>>>>> implementing integration test for current and future runners.
>> There
>>> is
>>>>>> unit
>>>>>>> test coverage for all modules, and a number of integrations test
>> for
>>>>>>> validating existing runners.
>>>>>>> * Documentation and examples: Documentation work has started on
>>> Python
>>>>>> SDK.
>>>>>>> Beam Programming Guide page has been updated to include Python
>> [2].
>>> The
>>>>>>> code comes with many ready to use examples and we are in a good
>>> place
>>>>> to
>>>>>>> start documenting those on the website.
>>>>>>>
>>>>>>> ** We are not done yet, next on the roadmap we have:
>>>>>>>
>>>>>>> * Streaming: Both of the existing runners lack support for
>> streaming
>>>>>>> execution, and currently there is work going on for adding
>> streaming
>>>>>>> support to DirectRunner [3].
>>>>>>> * Documentation: Filling the rest of the Beam documentations with
>>>>> Python
>>>>>>> SDK specific information and examples.
>>>>>>> * SDK consistency: Making Python SDK consistent with the Java SDK.
>>> We
>>>>>> have
>>>>>>> come a long way on this and have only a few items left [4].
>>>>>>> * Beamifying: We have been working on removing Dataflow-specific
>>>>>> references
>>>>>>> both from the documentation and from the code. There is some work
>>> left,
>>>>>> and
>>>>>>> we are currently working on those as well [5].
>>>>>>>
>>>>>>> ** Steps and implications of merging to master:
>>>>>>>
>>>>>>> * Master branch is merged to python-sdk branch at regular
>> intervals
>>> and
>>>>>> the
>>>>>>> last merge was on 12/22. All the past merges were uneventful
>> because
>>>>>> there
>>>>>>> is a minimal overlap in modified files between branches.
>> Integrating
>>>>>>> python-sdk to master will similarly touch a small number of
>> existing
>>>>>> files.
>>>>>>>
>>>>>>> * Python SDK is using the same tools for building and testing. It
>> is
>>>>>>> already integrated with Maven, Jenkins and Travis. Specifically
>> the
>>>>>> impact
>>>>>>> to the testing infrastructure would be:
>>>>>>> - There will be two additional test configurations in Travis.
>> Since
>>>>>> Travis
>>>>>>> runs all configurations in parallel there should not be a
>> noticeable
>>>>>> change
>>>>>>> in the Travis run time.
>>>>>>> - Jenkins pre-commit test will start running the Python SDK tests.
>>> It
>>>>>> will
>>>>>>> add an additional 5 minutes to the completion time of pre-commit
>>> test.
>>>>>>> Historically Python SDK tests were not flaky and did not cause any
>>>>> random
>>>>>>> failures.
>>>>>>> - Jenkins Python post-commit test is already separated from the
>>> other
>>>>>>> post-commit tests and will continue to exist. It would not change
>>> the
>>>>>>> testing time for any other test.
>>>>>>>
>>>>>>> * The release process needs to be updated to accommodate releasing
>>>>> Python
>>>>>>> artifacts. Python SDK would fit in the existing release schedule
>> and
>>>>>> could
>>>>>>> be released along with the Java SDK. The additional steps would
>>>>> include:
>>>>>>> - Generating Python artifacts. This could be done with a single
>>> command
>>>>>>> using Maven today.
>>>>>>> - Publishing the artifacts to a central repository such as PyPI.
>>>>>>>
>>>>>>
>>>>>> I'm more than happy to help on this. We left on purpose some things
>>> open
>>>>>> when we added Maven support to the Python build.
>>>>>>
>>>>>
>>>>> That would be awesome. We can coordinate on that post-merge.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> - Updating the release guide to reflect the changes above.
>>>>>>>
>>>>>>> * Users: There are existing users using the Python SDK. To give a
>>> rough
>>>>>>> estimate, a distribution of the Beam Python SDK had a total of 23K
>>>>>>> downloads in the past 6 months [6]. Some of those users are
>> already
>>>>>> engaged
>>>>>>> with the community (e.g. [7]). There might be an increased amount
>>>>>>> engagement from the rest of them after the merge.
>>>>>>>
>>>>>>
>>>>>> Python 3 support is something we definitively need to look ahead.
>> I'd
>>> try
>>>>>> to make the codebase compatible with both 2.7.x and 3.6.x, rather
>> than
>>>>>> using other  solutions like 2to3.
>>>>>>
>>>>>
>>>>> I agree with you. I think it makes more sense to make codebase
>>> compatible
>>>>> with both. As you mentioned Python 3 support is not a short-term goal
>> in
>>>>> the roadmap, and we can discuss it more as we approach that.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Looking forward to hearing your thoughts and comments on
>> \u201cgraduating\u201d
>>>>>>> python-sdk to the master.
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Ahmet
>>>>>>>
>>>>>>> (*) Python SDK branch currently has a diverse group of
>> contributors.
>>>>>>> Regular contributors include Charles Chen, Chamikara Jayalath,
>> Mar�a
>>>>>> Garc�a
>>>>>>> Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam
>> PMC),
>>>>>>> Sourabh Bajaj, and Vikas Kedigehalli. We have also had
>> contributions
>>>>> from
>>>>>>> Abdullah Bashir, Marco Buccini, Sergio Fern�ndez, Seunghyun Lee,
>> and
>>>>>>> Younghee Kwon.
>>>>>>>
>>>>>>> [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
>>>>>>> [2] https://beam.apache.org/documentation/programming-guide/
>>>>>>> [3] https://issues.apache.org/jira/browse/BEAM-1265
>>>>>>> [4]
>>>>>>> https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
>>>>>>> en%20AND%20labels%20%3D%20sdk-consistency
>>>>>>> [5] https://issues.apache.org/jira/browse/BEAM-1218
>>>>>>> [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
>>>>>>> [7] https://issues.apache.org/jira/browse/BEAM-1251
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Great summary, Ahmet. Thanks.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> --
>>>>>> Sergio Fern�ndez
>>>>>> Partner Technology Manager
>>>>>> Redlink GmbH
>>>>>> m: +43 6602747925
>>>>>> e: sergio.fernandez@redlink.co
>>>>>> w: http://redlink.co
>>>>>>
>>>>>
>>>
>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSS] Python SDK status and next steps

Posted by Sergio Fernández <wi...@apache.org>.
great!

On Tue, Jan 31, 2017 at 8:10 AM, Ahmet Altay <al...@google.com.invalid>
wrote:

> Hi all,
>
> This merge is completed. Python SDK is now officially part of the master
> branch! Thank you all for the support. Please open an issue, if you notice
> a reference to the now obsolete python-sdk branch in the documentation.
>
> There will not be any more merges to the python-sdk branch. Going forward
> please use the master branch for Python SDK development. There are a few
> existing open PRs to the python-sdk [1]. If you are the author of one of
> those PRs, please rebase them on top of master.
>
> Thank you,
> Ahmet
>
> [1] https://github.com/pulls?utf8=✓&q=is%3Aopen+is%3Apr+base%
> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25>
> 3Apython-sdk+repo%3Aapache%2Fbeam+
> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%
> 3Apr+base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>
> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles <kl...@google.com.invalid>
> wrote:
>
> > To clarify the implied criteria of that last exchange, it is "An SDK
> should
> > have at least one runner that can execute the complete model (may be a
> > direct runner)"
> >
> > I want to highlight this, because whether an _SDK_ supports unbounded
> data
> > is not particularly well-defined, and will evolve:
> >
> >  - With the Runner API, an SDK will need to support building a graph with
> > unbounded constructs, as today with probably minimal changes.
> >
> >  - With the Fn API, if any part of the Fn API is specific to unbounded
> > data, the SDK will need to implement it. I think right now there is no
> such
> > thing, and we don't want such a thing, so SDKs implementing the Fn API
> > automatically support unbounded data.
> >
> >  - There will also likely be an SDK-specific shim just as there is today,
> > to leverage idiomatic deserialized representations. The richness of this
> > shim will decrease so that it will need to "support" unbounded data but
> > that will be a ~one liner.
> >
> > Getting the Python SDK on master will accelerate our progress towards the
> > Fn API - partly technical, partly community - which is the best path
> > towards support for unbounded data across multiple runners. I think the
> > criteria are written with the completed portability framework in mind. So
> > this exchange makes me actually more convinced we should merge python-sdk
> > to master.
> >
> > On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
> > robertwb@google.com.invalid> wrote:
> >
> > > On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
> > > <dh...@google.com.invalid> wrote:
> > > > I do not think that Python SDK yet meets the bar [1] for implementing
> > the
> > > > Beam model -- supporting Unbounded data is very important. That said,
> > > given
> > > > the committed and sustained set of contributors, it generally makes
> > sense
> > > > to me to make an exception in anticipation of these features being
> > > fleshed
> > > > out soon; including potentially new users/contributors that would
> > arrive
> > > > once in master.
> > > >
> > > > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> > > > k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com
> > >
> > > That is a valid point. The Python SDK supports all the unbounded parts
> > > of the model except for unbounded sources, which was deferred while
> > > seeing how https://s.apache.org/splittable-do-fn played out. I've been
> > > working with the team and merging/reviewing most of their code, and
> > > have full confidence this will be coming (and on that note can vouch
> > > for a healthy community and support which are much harder to add
> > > later).
> > >
> > > In short, I think it has the required maturity, and I'm in favor of
> > > merging soonish.
> > >
> > > > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
> <altay@google.com.invalid
> > >
> > > > wrote:
> > > >
> > > >> Thank you all for the comments so far. I would follow the process as
> > > >> suggested by Davor and others in this thread.
> > > >>
> > > >> Ahmet
> > > >>
> > > >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <
> wikier@apache.org
> > >
> > > >> wrote:
> > > >>
> > > >> > Hi
> > > >> >
> > > >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
> > <altay@google.com.invalid
> > > >
> > > >> > wrote:
> > > >> > >
> > > >> > > tl;dr: I would like to start a discussion about merging
> python-sdk
> > > >> branch
> > > >> > > to master branch. Python SDK is mature enough and merging it to
> > > master
> > > >> > will
> > > >> > > accelerate its development and adoption.
> > > >> > >
> > > >> >
> > > >> > Good point, Ahmet!
> > > >> >
> > > >> > I've following closed the development since it was imported in
> June.
> > > For
> > > >> > the prototypes I've implemented so far it works quite well; I
> guess
> > > we'd
> > > >> > just need to focus the next months in bringing more runners
> support.
> > > >> >
> > > >> > With a great effort from a lot of contributors(*), Python SDK [1]
> is
> > > now
> > > >> a
> > > >> > > mostly complete, tested, performant Python implementation of the
> > > Beam
> > > >> > > model. Since June, when we first started with Python SDK in
> Apache
> > > Beam
> > > >> > we
> > > >> > > have been continuously improving it.
> > > >> > >
> > > >> >
> > > >> > I wouldn't merge during the preparation of 0.5.0 release, but
> after
> > > that
> > > >> > could be a good time to merge back into master.
> > > >> >
> > > >> >
> > > >> > ** Python SDK currently supports:
> > > >> > >
> > > >> > > * Model: All main concepts are present (ParDo, GroupByKey,
> > Windowing
> > > >> > etc.).
> > > >> > > * IO: There are extensible APIs for writing new bounded sources
> > and
> > > >> > sinks.
> > > >> > > Implementations are provided for Text, Avro, BigQuery, and
> > > Datastore.
> > > >> > > * Runners: Python SDK has an extensible base runner module that
> > > allows
> > > >> > > building specific runners on top of it. The SDK comes with two
> > > pipeline
> > > >> > > runners: DirectRunner and DataflowRunner; and it is possible to
> > add
> > > >> more.
> > > >> > > The existing runners are currently limited to bounded execution
> > and
> > > >> > > otherwise equivalent to their Java SDK counterparts in
> > > functionality.
> > > >> > >
> > > >> >
> > > >> > What would the effort of porting, and maintaining, parallel
> versions
> > > of
> > > >> the
> > > >> > Java runners? I guess I'd need to dig deeper in the model, but
> this
> > > may
> > > >> > represent a major effort for the project, right?
> > > >> >
> > > >>
> > > >> It is somewhat higher for DirectRunner because DirectRunner also
> > > implements
> > > >> the code for execution. It is not that high for DataflowRunner
> because
> > > the
> > > >> base runner module has a lot of helpers with the right hooks for
> > > >> implementing a generic runner. I would _expect_ the experience in
> > > general
> > > >> would be similar to the latter.
> > > >>
> > > >>
> > > >> >
> > > >> >
> > > >> >
> > > >> > > * Testing: Python SDK implements ValidatesRunner test framework
> > for
> > > >> > > implementing integration test for current and future runners.
> > There
> > > is
> > > >> > unit
> > > >> > > test coverage for all modules, and a number of integrations test
> > for
> > > >> > > validating existing runners.
> > > >> > > * Documentation and examples: Documentation work has started on
> > > Python
> > > >> > SDK.
> > > >> > > Beam Programming Guide page has been updated to include Python
> > [2].
> > > The
> > > >> > > code comes with many ready to use examples and we are in a good
> > > place
> > > >> to
> > > >> > > start documenting those on the website.
> > > >> > >
> > > >> > > ** We are not done yet, next on the roadmap we have:
> > > >> > >
> > > >> > > * Streaming: Both of the existing runners lack support for
> > streaming
> > > >> > > execution, and currently there is work going on for adding
> > streaming
> > > >> > > support to DirectRunner [3].
> > > >> > > * Documentation: Filling the rest of the Beam documentations
> with
> > > >> Python
> > > >> > > SDK specific information and examples.
> > > >> > > * SDK consistency: Making Python SDK consistent with the Java
> SDK.
> > > We
> > > >> > have
> > > >> > > come a long way on this and have only a few items left [4].
> > > >> > > * Beamifying: We have been working on removing Dataflow-specific
> > > >> > references
> > > >> > > both from the documentation and from the code. There is some
> work
> > > left,
> > > >> > and
> > > >> > > we are currently working on those as well [5].
> > > >> > >
> > > >> > > ** Steps and implications of merging to master:
> > > >> > >
> > > >> > > * Master branch is merged to python-sdk branch at regular
> > intervals
> > > and
> > > >> > the
> > > >> > > last merge was on 12/22. All the past merges were uneventful
> > because
> > > >> > there
> > > >> > > is a minimal overlap in modified files between branches.
> > Integrating
> > > >> > > python-sdk to master will similarly touch a small number of
> > existing
> > > >> > files.
> > > >> > >
> > > >> > > * Python SDK is using the same tools for building and testing.
> It
> > is
> > > >> > > already integrated with Maven, Jenkins and Travis. Specifically
> > the
> > > >> > impact
> > > >> > > to the testing infrastructure would be:
> > > >> > > - There will be two additional test configurations in Travis.
> > Since
> > > >> > Travis
> > > >> > > runs all configurations in parallel there should not be a
> > noticeable
> > > >> > change
> > > >> > > in the Travis run time.
> > > >> > > - Jenkins pre-commit test will start running the Python SDK
> tests.
> > > It
> > > >> > will
> > > >> > > add an additional 5 minutes to the completion time of pre-commit
> > > test.
> > > >> > > Historically Python SDK tests were not flaky and did not cause
> any
> > > >> random
> > > >> > > failures.
> > > >> > > - Jenkins Python post-commit test is already separated from the
> > > other
> > > >> > > post-commit tests and will continue to exist. It would not
> change
> > > the
> > > >> > > testing time for any other test.
> > > >> > >
> > > >> > > * The release process needs to be updated to accommodate
> releasing
> > > >> Python
> > > >> > > artifacts. Python SDK would fit in the existing release schedule
> > and
> > > >> > could
> > > >> > > be released along with the Java SDK. The additional steps would
> > > >> include:
> > > >> > > - Generating Python artifacts. This could be done with a single
> > > command
> > > >> > > using Maven today.
> > > >> > > - Publishing the artifacts to a central repository such as PyPI.
> > > >> > >
> > > >> >
> > > >> > I'm more than happy to help on this. We left on purpose some
> things
> > > open
> > > >> > when we added Maven support to the Python build.
> > > >> >
> > > >>
> > > >> That would be awesome. We can coordinate on that post-merge.
> > > >>
> > > >>
> > > >> >
> > > >> >
> > > >> >
> > > >> > > - Updating the release guide to reflect the changes above.
> > > >> > >
> > > >> > > * Users: There are existing users using the Python SDK. To give
> a
> > > rough
> > > >> > > estimate, a distribution of the Beam Python SDK had a total of
> 23K
> > > >> > > downloads in the past 6 months [6]. Some of those users are
> > already
> > > >> > engaged
> > > >> > > with the community (e.g. [7]). There might be an increased
> amount
> > > >> > > engagement from the rest of them after the merge.
> > > >> > >
> > > >> >
> > > >> > Python 3 support is something we definitively need to look ahead.
> > I'd
> > > try
> > > >> > to make the codebase compatible with both 2.7.x and 3.6.x, rather
> > than
> > > >> > using other  solutions like 2to3.
> > > >> >
> > > >>
> > > >> I agree with you. I think it makes more sense to make codebase
> > > compatible
> > > >> with both. As you mentioned Python 3 support is not a short-term
> goal
> > in
> > > >> the roadmap, and we can discuss it more as we approach that.
> > > >>
> > > >>
> > > >> >
> > > >> >
> > > >> > Looking forward to hearing your thoughts and comments on
> > “graduating”
> > > >> > > python-sdk to the master.
> > > >> > >
> > > >> > > Thank you,
> > > >> > > Ahmet
> > > >> > >
> > > >> > > (*) Python SDK branch currently has a diverse group of
> > contributors.
> > > >> > > Regular contributors include Charles Chen, Chamikara Jayalath,
> > María
> > > >> > García
> > > >> > > Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam
> > PMC),
> > > >> > > Sourabh Bajaj, and Vikas Kedigehalli. We have also had
> > contributions
> > > >> from
> > > >> > > Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee,
> > and
> > > >> > > Younghee Kwon.
> > > >> > >
> > > >> > > [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
> > > >> > > [2] https://beam.apache.org/documentation/programming-guide/
> > > >> > > [3] https://issues.apache.org/jira/browse/BEAM-1265
> > > >> > > [4]
> > > >> > > https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
> > > >> > > en%20AND%20labels%20%3D%20sdk-consistency
> > > >> > > [5] https://issues.apache.org/jira/browse/BEAM-1218
> > > >> > > [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
> > > >> > > [7] https://issues.apache.org/jira/browse/BEAM-1251
> > > >> > >
> > > >> >
> > > >> >
> > > >> > Great summary, Ahmet. Thanks.
> > > >> >
> > > >> > Cheers,
> > > >> >
> > > >> > --
> > > >> > Sergio Fernández
> > > >> > Partner Technology Manager
> > > >> > Redlink GmbH
> > > >> > m: +43 6602747925
> > > >> > e: sergio.fernandez@redlink.co
> > > >> > w: http://redlink.co
> > > >> >
> > > >>
> > >
> >
>



-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernandez@redlink.co
w: http://redlink.co

Re: [DISCUSS] Python SDK status and next steps

Posted by "Prabeesh K." <pr...@gmail.com>.
https://issues.apache.org/jira/browse/BEAM-1360

On 31 January 2017 at 12:12, Prabeesh K. <pr...@gmail.com> wrote:

> https://issues.apache.org/jira/browse/BAHIR-86
>
> On 31 January 2017 at 11:10, Ahmet Altay <al...@google.com.invalid> wrote:
>
>> Hi all,
>>
>> This merge is completed. Python SDK is now officially part of the master
>> branch! Thank you all for the support. Please open an issue, if you notice
>> a reference to the now obsolete python-sdk branch in the documentation.
>>
>> There will not be any more merges to the python-sdk branch. Going forward
>> please use the master branch for Python SDK development. There are a few
>> existing open PRs to the python-sdk [1]. If you are the author of one of
>> those PRs, please rebase them on top of master.
>>
>> Thank you,
>> Ahmet
>>
>> [1] https://github.com/pulls?utf8=✓&q=is%3Aopen+is%3Apr+base%
>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25>
>> 3Apython-sdk+repo%3Aapache%2Fbeam+
>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr
>> +base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>>
>> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles <klk@google.com.invalid
>> >
>> wrote:
>>
>> > To clarify the implied criteria of that last exchange, it is "An SDK
>> should
>> > have at least one runner that can execute the complete model (may be a
>> > direct runner)"
>> >
>> > I want to highlight this, because whether an _SDK_ supports unbounded
>> data
>> > is not particularly well-defined, and will evolve:
>> >
>> >  - With the Runner API, an SDK will need to support building a graph
>> with
>> > unbounded constructs, as today with probably minimal changes.
>> >
>> >  - With the Fn API, if any part of the Fn API is specific to unbounded
>> > data, the SDK will need to implement it. I think right now there is no
>> such
>> > thing, and we don't want such a thing, so SDKs implementing the Fn API
>> > automatically support unbounded data.
>> >
>> >  - There will also likely be an SDK-specific shim just as there is
>> today,
>> > to leverage idiomatic deserialized representations. The richness of this
>> > shim will decrease so that it will need to "support" unbounded data but
>> > that will be a ~one liner.
>> >
>> > Getting the Python SDK on master will accelerate our progress towards
>> the
>> > Fn API - partly technical, partly community - which is the best path
>> > towards support for unbounded data across multiple runners. I think the
>> > criteria are written with the completed portability framework in mind.
>> So
>> > this exchange makes me actually more convinced we should merge
>> python-sdk
>> > to master.
>> >
>> > On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
>> > robertwb@google.com.invalid> wrote:
>> >
>> > > On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>> > > <dh...@google.com.invalid> wrote:
>> > > > I do not think that Python SDK yet meets the bar [1] for
>> implementing
>> > the
>> > > > Beam model -- supporting Unbounded data is very important. That
>> said,
>> > > given
>> > > > the committed and sustained set of contributors, it generally makes
>> > sense
>> > > > to me to make an exception in anticipation of these features being
>> > > fleshed
>> > > > out soon; including potentially new users/contributors that would
>> > arrive
>> > > > once in master.
>> > > >
>> > > > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
>> > > > k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com
>> > >
>> > > That is a valid point. The Python SDK supports all the unbounded parts
>> > > of the model except for unbounded sources, which was deferred while
>> > > seeing how https://s.apache.org/splittable-do-fn played out. I've
>> been
>> > > working with the team and merging/reviewing most of their code, and
>> > > have full confidence this will be coming (and on that note can vouch
>> > > for a healthy community and support which are much harder to add
>> > > later).
>> > >
>> > > In short, I think it has the required maturity, and I'm in favor of
>> > > merging soonish.
>> > >
>> > > > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
>> <altay@google.com.invalid
>> > >
>> > > > wrote:
>> > > >
>> > > >> Thank you all for the comments so far. I would follow the process
>> as
>> > > >> suggested by Davor and others in this thread.
>> > > >>
>> > > >> Ahmet
>> > > >>
>> > > >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <
>> wikier@apache.org
>> > >
>> > > >> wrote:
>> > > >>
>> > > >> > Hi
>> > > >> >
>> > > >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
>> > <altay@google.com.invalid
>> > > >
>> > > >> > wrote:
>> > > >> > >
>> > > >> > > tl;dr: I would like to start a discussion about merging
>> python-sdk
>> > > >> branch
>> > > >> > > to master branch. Python SDK is mature enough and merging it to
>> > > master
>> > > >> > will
>> > > >> > > accelerate its development and adoption.
>> > > >> > >
>> > > >> >
>> > > >> > Good point, Ahmet!
>> > > >> >
>> > > >> > I've following closed the development since it was imported in
>> June.
>> > > For
>> > > >> > the prototypes I've implemented so far it works quite well; I
>> guess
>> > > we'd
>> > > >> > just need to focus the next months in bringing more runners
>> support.
>> > > >> >
>> > > >> > With a great effort from a lot of contributors(*), Python SDK
>> [1] is
>> > > now
>> > > >> a
>> > > >> > > mostly complete, tested, performant Python implementation of
>> the
>> > > Beam
>> > > >> > > model. Since June, when we first started with Python SDK in
>> Apache
>> > > Beam
>> > > >> > we
>> > > >> > > have been continuously improving it.
>> > > >> > >
>> > > >> >
>> > > >> > I wouldn't merge during the preparation of 0.5.0 release, but
>> after
>> > > that
>> > > >> > could be a good time to merge back into master.
>> > > >> >
>> > > >> >
>> > > >> > ** Python SDK currently supports:
>> > > >> > >
>> > > >> > > * Model: All main concepts are present (ParDo, GroupByKey,
>> > Windowing
>> > > >> > etc.).
>> > > >> > > * IO: There are extensible APIs for writing new bounded sources
>> > and
>> > > >> > sinks.
>> > > >> > > Implementations are provided for Text, Avro, BigQuery, and
>> > > Datastore.
>> > > >> > > * Runners: Python SDK has an extensible base runner module that
>> > > allows
>> > > >> > > building specific runners on top of it. The SDK comes with two
>> > > pipeline
>> > > >> > > runners: DirectRunner and DataflowRunner; and it is possible to
>> > add
>> > > >> more.
>> > > >> > > The existing runners are currently limited to bounded execution
>> > and
>> > > >> > > otherwise equivalent to their Java SDK counterparts in
>> > > functionality.
>> > > >> > >
>> > > >> >
>> > > >> > What would the effort of porting, and maintaining, parallel
>> versions
>> > > of
>> > > >> the
>> > > >> > Java runners? I guess I'd need to dig deeper in the model, but
>> this
>> > > may
>> > > >> > represent a major effort for the project, right?
>> > > >> >
>> > > >>
>> > > >> It is somewhat higher for DirectRunner because DirectRunner also
>> > > implements
>> > > >> the code for execution. It is not that high for DataflowRunner
>> because
>> > > the
>> > > >> base runner module has a lot of helpers with the right hooks for
>> > > >> implementing a generic runner. I would _expect_ the experience in
>> > > general
>> > > >> would be similar to the latter.
>> > > >>
>> > > >>
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > > * Testing: Python SDK implements ValidatesRunner test framework
>> > for
>> > > >> > > implementing integration test for current and future runners.
>> > There
>> > > is
>> > > >> > unit
>> > > >> > > test coverage for all modules, and a number of integrations
>> test
>> > for
>> > > >> > > validating existing runners.
>> > > >> > > * Documentation and examples: Documentation work has started on
>> > > Python
>> > > >> > SDK.
>> > > >> > > Beam Programming Guide page has been updated to include Python
>> > [2].
>> > > The
>> > > >> > > code comes with many ready to use examples and we are in a good
>> > > place
>> > > >> to
>> > > >> > > start documenting those on the website.
>> > > >> > >
>> > > >> > > ** We are not done yet, next on the roadmap we have:
>> > > >> > >
>> > > >> > > * Streaming: Both of the existing runners lack support for
>> > streaming
>> > > >> > > execution, and currently there is work going on for adding
>> > streaming
>> > > >> > > support to DirectRunner [3].
>> > > >> > > * Documentation: Filling the rest of the Beam documentations
>> with
>> > > >> Python
>> > > >> > > SDK specific information and examples.
>> > > >> > > * SDK consistency: Making Python SDK consistent with the Java
>> SDK.
>> > > We
>> > > >> > have
>> > > >> > > come a long way on this and have only a few items left [4].
>> > > >> > > * Beamifying: We have been working on removing
>> Dataflow-specific
>> > > >> > references
>> > > >> > > both from the documentation and from the code. There is some
>> work
>> > > left,
>> > > >> > and
>> > > >> > > we are currently working on those as well [5].
>> > > >> > >
>> > > >> > > ** Steps and implications of merging to master:
>> > > >> > >
>> > > >> > > * Master branch is merged to python-sdk branch at regular
>> > intervals
>> > > and
>> > > >> > the
>> > > >> > > last merge was on 12/22. All the past merges were uneventful
>> > because
>> > > >> > there
>> > > >> > > is a minimal overlap in modified files between branches.
>> > Integrating
>> > > >> > > python-sdk to master will similarly touch a small number of
>> > existing
>> > > >> > files.
>> > > >> > >
>> > > >> > > * Python SDK is using the same tools for building and testing.
>> It
>> > is
>> > > >> > > already integrated with Maven, Jenkins and Travis. Specifically
>> > the
>> > > >> > impact
>> > > >> > > to the testing infrastructure would be:
>> > > >> > > - There will be two additional test configurations in Travis.
>> > Since
>> > > >> > Travis
>> > > >> > > runs all configurations in parallel there should not be a
>> > noticeable
>> > > >> > change
>> > > >> > > in the Travis run time.
>> > > >> > > - Jenkins pre-commit test will start running the Python SDK
>> tests.
>> > > It
>> > > >> > will
>> > > >> > > add an additional 5 minutes to the completion time of
>> pre-commit
>> > > test.
>> > > >> > > Historically Python SDK tests were not flaky and did not cause
>> any
>> > > >> random
>> > > >> > > failures.
>> > > >> > > - Jenkins Python post-commit test is already separated from the
>> > > other
>> > > >> > > post-commit tests and will continue to exist. It would not
>> change
>> > > the
>> > > >> > > testing time for any other test.
>> > > >> > >
>> > > >> > > * The release process needs to be updated to accommodate
>> releasing
>> > > >> Python
>> > > >> > > artifacts. Python SDK would fit in the existing release
>> schedule
>> > and
>> > > >> > could
>> > > >> > > be released along with the Java SDK. The additional steps would
>> > > >> include:
>> > > >> > > - Generating Python artifacts. This could be done with a single
>> > > command
>> > > >> > > using Maven today.
>> > > >> > > - Publishing the artifacts to a central repository such as
>> PyPI.
>> > > >> > >
>> > > >> >
>> > > >> > I'm more than happy to help on this. We left on purpose some
>> things
>> > > open
>> > > >> > when we added Maven support to the Python build.
>> > > >> >
>> > > >>
>> > > >> That would be awesome. We can coordinate on that post-merge.
>> > > >>
>> > > >>
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > > - Updating the release guide to reflect the changes above.
>> > > >> > >
>> > > >> > > * Users: There are existing users using the Python SDK. To
>> give a
>> > > rough
>> > > >> > > estimate, a distribution of the Beam Python SDK had a total of
>> 23K
>> > > >> > > downloads in the past 6 months [6]. Some of those users are
>> > already
>> > > >> > engaged
>> > > >> > > with the community (e.g. [7]). There might be an increased
>> amount
>> > > >> > > engagement from the rest of them after the merge.
>> > > >> > >
>> > > >> >
>> > > >> > Python 3 support is something we definitively need to look ahead.
>> > I'd
>> > > try
>> > > >> > to make the codebase compatible with both 2.7.x and 3.6.x, rather
>> > than
>> > > >> > using other  solutions like 2to3.
>> > > >> >
>> > > >>
>> > > >> I agree with you. I think it makes more sense to make codebase
>> > > compatible
>> > > >> with both. As you mentioned Python 3 support is not a short-term
>> goal
>> > in
>> > > >> the roadmap, and we can discuss it more as we approach that.
>> > > >>
>> > > >>
>> > > >> >
>> > > >> >
>> > > >> > Looking forward to hearing your thoughts and comments on
>> > “graduating”
>> > > >> > > python-sdk to the master.
>> > > >> > >
>> > > >> > > Thank you,
>> > > >> > > Ahmet
>> > > >> > >
>> > > >> > > (*) Python SDK branch currently has a diverse group of
>> > contributors.
>> > > >> > > Regular contributors include Charles Chen, Chamikara Jayalath,
>> > María
>> > > >> > García
>> > > >> > > Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam
>> > PMC),
>> > > >> > > Sourabh Bajaj, and Vikas Kedigehalli. We have also had
>> > contributions
>> > > >> from
>> > > >> > > Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun
>> Lee,
>> > and
>> > > >> > > Younghee Kwon.
>> > > >> > >
>> > > >> > > [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
>> > > >> > > [2] https://beam.apache.org/documentation/programming-guide/
>> > > >> > > [3] https://issues.apache.org/jira/browse/BEAM-1265
>> > > >> > > [4]
>> > > >> > > https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
>> > > >> > > en%20AND%20labels%20%3D%20sdk-consistency
>> > > >> > > [5] https://issues.apache.org/jira/browse/BEAM-1218
>> > > >> > > [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
>> > > >> > > [7] https://issues.apache.org/jira/browse/BEAM-1251
>> > > >> > >
>> > > >> >
>> > > >> >
>> > > >> > Great summary, Ahmet. Thanks.
>> > > >> >
>> > > >> > Cheers,
>> > > >> >
>> > > >> > --
>> > > >> > Sergio Fernández
>> > > >> > Partner Technology Manager
>> > > >> > Redlink GmbH
>> > > >> > m: +43 6602747925
>> > > >> > e: sergio.fernandez@redlink.co
>> > > >> > w: http://redlink.co
>> > > >> >
>> > > >>
>> > >
>> >
>>
>
>

Re: [DISCUSS] Python SDK status and next steps

Posted by "Prabeesh K." <pr...@gmail.com>.
https://issues.apache.org/jira/browse/BAHIR-86

On 31 January 2017 at 11:10, Ahmet Altay <al...@google.com.invalid> wrote:

> Hi all,
>
> This merge is completed. Python SDK is now officially part of the master
> branch! Thank you all for the support. Please open an issue, if you notice
> a reference to the now obsolete python-sdk branch in the documentation.
>
> There will not be any more merges to the python-sdk branch. Going forward
> please use the master branch for Python SDK development. There are a few
> existing open PRs to the python-sdk [1]. If you are the author of one of
> those PRs, please rebase them on top of master.
>
> Thank you,
> Ahmet
>
> [1] https://github.com/pulls?utf8=✓&q=is%3Aopen+is%3Apr+base%
> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25>
> 3Apython-sdk+repo%3Aapache%2Fbeam+
> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%
> 3Apr+base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>
> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles <kl...@google.com.invalid>
> wrote:
>
> > To clarify the implied criteria of that last exchange, it is "An SDK
> should
> > have at least one runner that can execute the complete model (may be a
> > direct runner)"
> >
> > I want to highlight this, because whether an _SDK_ supports unbounded
> data
> > is not particularly well-defined, and will evolve:
> >
> >  - With the Runner API, an SDK will need to support building a graph with
> > unbounded constructs, as today with probably minimal changes.
> >
> >  - With the Fn API, if any part of the Fn API is specific to unbounded
> > data, the SDK will need to implement it. I think right now there is no
> such
> > thing, and we don't want such a thing, so SDKs implementing the Fn API
> > automatically support unbounded data.
> >
> >  - There will also likely be an SDK-specific shim just as there is today,
> > to leverage idiomatic deserialized representations. The richness of this
> > shim will decrease so that it will need to "support" unbounded data but
> > that will be a ~one liner.
> >
> > Getting the Python SDK on master will accelerate our progress towards the
> > Fn API - partly technical, partly community - which is the best path
> > towards support for unbounded data across multiple runners. I think the
> > criteria are written with the completed portability framework in mind. So
> > this exchange makes me actually more convinced we should merge python-sdk
> > to master.
> >
> > On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
> > robertwb@google.com.invalid> wrote:
> >
> > > On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
> > > <dh...@google.com.invalid> wrote:
> > > > I do not think that Python SDK yet meets the bar [1] for implementing
> > the
> > > > Beam model -- supporting Unbounded data is very important. That said,
> > > given
> > > > the committed and sustained set of contributors, it generally makes
> > sense
> > > > to me to make an exception in anticipation of these features being
> > > fleshed
> > > > out soon; including potentially new users/contributors that would
> > arrive
> > > > once in master.
> > > >
> > > > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> > > > k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com
> > >
> > > That is a valid point. The Python SDK supports all the unbounded parts
> > > of the model except for unbounded sources, which was deferred while
> > > seeing how https://s.apache.org/splittable-do-fn played out. I've been
> > > working with the team and merging/reviewing most of their code, and
> > > have full confidence this will be coming (and on that note can vouch
> > > for a healthy community and support which are much harder to add
> > > later).
> > >
> > > In short, I think it has the required maturity, and I'm in favor of
> > > merging soonish.
> > >
> > > > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
> <altay@google.com.invalid
> > >
> > > > wrote:
> > > >
> > > >> Thank you all for the comments so far. I would follow the process as
> > > >> suggested by Davor and others in this thread.
> > > >>
> > > >> Ahmet
> > > >>
> > > >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <
> wikier@apache.org
> > >
> > > >> wrote:
> > > >>
> > > >> > Hi
> > > >> >
> > > >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
> > <altay@google.com.invalid
> > > >
> > > >> > wrote:
> > > >> > >
> > > >> > > tl;dr: I would like to start a discussion about merging
> python-sdk
> > > >> branch
> > > >> > > to master branch. Python SDK is mature enough and merging it to
> > > master
> > > >> > will
> > > >> > > accelerate its development and adoption.
> > > >> > >
> > > >> >
> > > >> > Good point, Ahmet!
> > > >> >
> > > >> > I've following closed the development since it was imported in
> June.
> > > For
> > > >> > the prototypes I've implemented so far it works quite well; I
> guess
> > > we'd
> > > >> > just need to focus the next months in bringing more runners
> support.
> > > >> >
> > > >> > With a great effort from a lot of contributors(*), Python SDK [1]
> is
> > > now
> > > >> a
> > > >> > > mostly complete, tested, performant Python implementation of the
> > > Beam
> > > >> > > model. Since June, when we first started with Python SDK in
> Apache
> > > Beam
> > > >> > we
> > > >> > > have been continuously improving it.
> > > >> > >
> > > >> >
> > > >> > I wouldn't merge during the preparation of 0.5.0 release, but
> after
> > > that
> > > >> > could be a good time to merge back into master.
> > > >> >
> > > >> >
> > > >> > ** Python SDK currently supports:
> > > >> > >
> > > >> > > * Model: All main concepts are present (ParDo, GroupByKey,
> > Windowing
> > > >> > etc.).
> > > >> > > * IO: There are extensible APIs for writing new bounded sources
> > and
> > > >> > sinks.
> > > >> > > Implementations are provided for Text, Avro, BigQuery, and
> > > Datastore.
> > > >> > > * Runners: Python SDK has an extensible base runner module that
> > > allows
> > > >> > > building specific runners on top of it. The SDK comes with two
> > > pipeline
> > > >> > > runners: DirectRunner and DataflowRunner; and it is possible to
> > add
> > > >> more.
> > > >> > > The existing runners are currently limited to bounded execution
> > and
> > > >> > > otherwise equivalent to their Java SDK counterparts in
> > > functionality.
> > > >> > >
> > > >> >
> > > >> > What would the effort of porting, and maintaining, parallel
> versions
> > > of
> > > >> the
> > > >> > Java runners? I guess I'd need to dig deeper in the model, but
> this
> > > may
> > > >> > represent a major effort for the project, right?
> > > >> >
> > > >>
> > > >> It is somewhat higher for DirectRunner because DirectRunner also
> > > implements
> > > >> the code for execution. It is not that high for DataflowRunner
> because
> > > the
> > > >> base runner module has a lot of helpers with the right hooks for
> > > >> implementing a generic runner. I would _expect_ the experience in
> > > general
> > > >> would be similar to the latter.
> > > >>
> > > >>
> > > >> >
> > > >> >
> > > >> >
> > > >> > > * Testing: Python SDK implements ValidatesRunner test framework
> > for
> > > >> > > implementing integration test for current and future runners.
> > There
> > > is
> > > >> > unit
> > > >> > > test coverage for all modules, and a number of integrations test
> > for
> > > >> > > validating existing runners.
> > > >> > > * Documentation and examples: Documentation work has started on
> > > Python
> > > >> > SDK.
> > > >> > > Beam Programming Guide page has been updated to include Python
> > [2].
> > > The
> > > >> > > code comes with many ready to use examples and we are in a good
> > > place
> > > >> to
> > > >> > > start documenting those on the website.
> > > >> > >
> > > >> > > ** We are not done yet, next on the roadmap we have:
> > > >> > >
> > > >> > > * Streaming: Both of the existing runners lack support for
> > streaming
> > > >> > > execution, and currently there is work going on for adding
> > streaming
> > > >> > > support to DirectRunner [3].
> > > >> > > * Documentation: Filling the rest of the Beam documentations
> with
> > > >> Python
> > > >> > > SDK specific information and examples.
> > > >> > > * SDK consistency: Making Python SDK consistent with the Java
> SDK.
> > > We
> > > >> > have
> > > >> > > come a long way on this and have only a few items left [4].
> > > >> > > * Beamifying: We have been working on removing Dataflow-specific
> > > >> > references
> > > >> > > both from the documentation and from the code. There is some
> work
> > > left,
> > > >> > and
> > > >> > > we are currently working on those as well [5].
> > > >> > >
> > > >> > > ** Steps and implications of merging to master:
> > > >> > >
> > > >> > > * Master branch is merged to python-sdk branch at regular
> > intervals
> > > and
> > > >> > the
> > > >> > > last merge was on 12/22. All the past merges were uneventful
> > because
> > > >> > there
> > > >> > > is a minimal overlap in modified files between branches.
> > Integrating
> > > >> > > python-sdk to master will similarly touch a small number of
> > existing
> > > >> > files.
> > > >> > >
> > > >> > > * Python SDK is using the same tools for building and testing.
> It
> > is
> > > >> > > already integrated with Maven, Jenkins and Travis. Specifically
> > the
> > > >> > impact
> > > >> > > to the testing infrastructure would be:
> > > >> > > - There will be two additional test configurations in Travis.
> > Since
> > > >> > Travis
> > > >> > > runs all configurations in parallel there should not be a
> > noticeable
> > > >> > change
> > > >> > > in the Travis run time.
> > > >> > > - Jenkins pre-commit test will start running the Python SDK
> tests.
> > > It
> > > >> > will
> > > >> > > add an additional 5 minutes to the completion time of pre-commit
> > > test.
> > > >> > > Historically Python SDK tests were not flaky and did not cause
> any
> > > >> random
> > > >> > > failures.
> > > >> > > - Jenkins Python post-commit test is already separated from the
> > > other
> > > >> > > post-commit tests and will continue to exist. It would not
> change
> > > the
> > > >> > > testing time for any other test.
> > > >> > >
> > > >> > > * The release process needs to be updated to accommodate
> releasing
> > > >> Python
> > > >> > > artifacts. Python SDK would fit in the existing release schedule
> > and
> > > >> > could
> > > >> > > be released along with the Java SDK. The additional steps would
> > > >> include:
> > > >> > > - Generating Python artifacts. This could be done with a single
> > > command
> > > >> > > using Maven today.
> > > >> > > - Publishing the artifacts to a central repository such as PyPI.
> > > >> > >
> > > >> >
> > > >> > I'm more than happy to help on this. We left on purpose some
> things
> > > open
> > > >> > when we added Maven support to the Python build.
> > > >> >
> > > >>
> > > >> That would be awesome. We can coordinate on that post-merge.
> > > >>
> > > >>
> > > >> >
> > > >> >
> > > >> >
> > > >> > > - Updating the release guide to reflect the changes above.
> > > >> > >
> > > >> > > * Users: There are existing users using the Python SDK. To give
> a
> > > rough
> > > >> > > estimate, a distribution of the Beam Python SDK had a total of
> 23K
> > > >> > > downloads in the past 6 months [6]. Some of those users are
> > already
> > > >> > engaged
> > > >> > > with the community (e.g. [7]). There might be an increased
> amount
> > > >> > > engagement from the rest of them after the merge.
> > > >> > >
> > > >> >
> > > >> > Python 3 support is something we definitively need to look ahead.
> > I'd
> > > try
> > > >> > to make the codebase compatible with both 2.7.x and 3.6.x, rather
> > than
> > > >> > using other  solutions like 2to3.
> > > >> >
> > > >>
> > > >> I agree with you. I think it makes more sense to make codebase
> > > compatible
> > > >> with both. As you mentioned Python 3 support is not a short-term
> goal
> > in
> > > >> the roadmap, and we can discuss it more as we approach that.
> > > >>
> > > >>
> > > >> >
> > > >> >
> > > >> > Looking forward to hearing your thoughts and comments on
> > “graduating”
> > > >> > > python-sdk to the master.
> > > >> > >
> > > >> > > Thank you,
> > > >> > > Ahmet
> > > >> > >
> > > >> > > (*) Python SDK branch currently has a diverse group of
> > contributors.
> > > >> > > Regular contributors include Charles Chen, Chamikara Jayalath,
> > María
> > > >> > García
> > > >> > > Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam
> > PMC),
> > > >> > > Sourabh Bajaj, and Vikas Kedigehalli. We have also had
> > contributions
> > > >> from
> > > >> > > Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee,
> > and
> > > >> > > Younghee Kwon.
> > > >> > >
> > > >> > > [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
> > > >> > > [2] https://beam.apache.org/documentation/programming-guide/
> > > >> > > [3] https://issues.apache.org/jira/browse/BEAM-1265
> > > >> > > [4]
> > > >> > > https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
> > > >> > > en%20AND%20labels%20%3D%20sdk-consistency
> > > >> > > [5] https://issues.apache.org/jira/browse/BEAM-1218
> > > >> > > [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
> > > >> > > [7] https://issues.apache.org/jira/browse/BEAM-1251
> > > >> > >
> > > >> >
> > > >> >
> > > >> > Great summary, Ahmet. Thanks.
> > > >> >
> > > >> > Cheers,
> > > >> >
> > > >> > --
> > > >> > Sergio Fernández
> > > >> > Partner Technology Manager
> > > >> > Redlink GmbH
> > > >> > m: +43 6602747925
> > > >> > e: sergio.fernandez@redlink.co
> > > >> > w: http://redlink.co
> > > >> >
> > > >>
> > >
> >
>

Re: [DISCUSS] Python SDK status and next steps

Posted by Kenneth Knowles <kl...@google.com.INVALID>.
Awesome!

On Tue, Jan 31, 2017 at 9:38 AM, Ahmet Altay <al...@google.com.invalid>
wrote:

> Thank you Prabeesh and Sergio for fixing those!
>
> On Tue, Jan 31, 2017 at 4:51 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > Awesome, thanks Sergio ! Much appreciated ;)
> >
> > Regards
> > JB
> >
> >
> > On 01/31/2017 01:42 PM, Sergio Fernández wrote:
> >
> >> PR #1879 provides the basics: https://github.com/apache/beam/pull/1879
> >>
> >> On Tue, Jan 31, 2017 at 1:33 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> >> wrote:
> >>
> >> No, that's fine as soon as we clearly document the prerequisite for the
> >>> build. IMHO, we should provide quick BUILDING instructions in the
> >>> README.md.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>>
> >>> On 01/31/2017 01:24 PM, Sergio Fernández wrote:
> >>>
> >>> Originally we integrate the build in Maven with the default profile.
> >>>> Do you feel like it'd be better to have it under a separated profile
> or
> >>>> so?
> >>>>
> >>>> On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofré <
> jb@nanthrax.net
> >>>> >
> >>>> wrote:
> >>>>
> >>>> Just to be clear, the prerequisite to be able to build the Python SDK
> >>>> are:
> >>>>
> >>>>>
> >>>>> apt-get install python-setuptools
> >>>>> apt-get install python-pip
> >>>>>
> >>>>> It's also required by the default "regular" build.
> >>>>>
> >>>>> Regards
> >>>>> JB
> >>>>>
> >>>>>
> >>>>> On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:
> >>>>>
> >>>>> Just one thing I noticed (and can be helpful for others): to build
> Beam
> >>>>>
> >>>>>> we now need python setuptools installed.
> >>>>>>
> >>>>>> For instance, on Ubuntu, you have to do:
> >>>>>>
> >>>>>> apt-get install python-setuptools
> >>>>>>
> >>>>>> Same for the pip distribution.
> >>>>>>
> >>>>>> I guess (if not already done), we have to update README/Building
> >>>>>> instructions.
> >>>>>>
> >>>>>> Correct ?
> >>>>>>
> >>>>>> Regards
> >>>>>> JB
> >>>>>>
> >>>>>> On 01/31/2017 08:10 AM, Ahmet Altay wrote:
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>>>
> >>>>>>> This merge is completed. Python SDK is now officially part of the
> >>>>>>> master
> >>>>>>> branch! Thank you all for the support. Please open an issue, if you
> >>>>>>> notice
> >>>>>>> a reference to the now obsolete python-sdk branch in the
> >>>>>>> documentation.
> >>>>>>>
> >>>>>>> There will not be any more merges to the python-sdk branch. Going
> >>>>>>> forward
> >>>>>>> please use the master branch for Python SDK development. There are
> a
> >>>>>>> few
> >>>>>>> existing open PRs to the python-sdk [1]. If you are the author of
> one
> >>>>>>> of
> >>>>>>> those PRs, please rebase them on top of master.
> >>>>>>>
> >>>>>>> Thank you,
> >>>>>>> Ahmet
> >>>>>>>
> >>>>>>> [1] https://github.com/pulls?utf8=✓&q=is%3Aopen+is%3Apr+base%
> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25>
> >>>>>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%
> 3Apr+base%25>
> >>>>>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%
> 3Apr+base%25
> >>>>>>> >
> >>>>>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%
> 3Apr+base%25
> >>>>>>> >
> >>>>>>> 3Apython-sdk+repo%3Aapache%2Fbeam+
> >>>>>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr
> >>>>>>> +base%3Apython-sdk+repo%3Aapache%2Fbeam+>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles
> >>>>>>> <kl...@google.com.invalid>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> To clarify the implied criteria of that last exchange, it is "An
> SDK
> >>>>>>>
> >>>>>>> should
> >>>>>>>> have at least one runner that can execute the complete model (may
> >>>>>>>> be a
> >>>>>>>> direct runner)"
> >>>>>>>>
> >>>>>>>> I want to highlight this, because whether an _SDK_ supports
> >>>>>>>> unbounded
> >>>>>>>> data
> >>>>>>>> is not particularly well-defined, and will evolve:
> >>>>>>>>
> >>>>>>>>  - With the Runner API, an SDK will need to support building a
> graph
> >>>>>>>> with
> >>>>>>>> unbounded constructs, as today with probably minimal changes.
> >>>>>>>>
> >>>>>>>>  - With the Fn API, if any part of the Fn API is specific to
> >>>>>>>> unbounded
> >>>>>>>> data, the SDK will need to implement it. I think right now there
> is
> >>>>>>>> no such
> >>>>>>>> thing, and we don't want such a thing, so SDKs implementing the Fn
> >>>>>>>> API
> >>>>>>>> automatically support unbounded data.
> >>>>>>>>
> >>>>>>>>  - There will also likely be an SDK-specific shim just as there is
> >>>>>>>> today,
> >>>>>>>> to leverage idiomatic deserialized representations. The richness
> of
> >>>>>>>> this
> >>>>>>>> shim will decrease so that it will need to "support" unbounded
> data
> >>>>>>>> but
> >>>>>>>> that will be a ~one liner.
> >>>>>>>>
> >>>>>>>> Getting the Python SDK on master will accelerate our progress
> >>>>>>>> towards
> >>>>>>>> the
> >>>>>>>> Fn API - partly technical, partly community - which is the best
> path
> >>>>>>>> towards support for unbounded data across multiple runners. I
> think
> >>>>>>>> the
> >>>>>>>> criteria are written with the completed portability framework in
> >>>>>>>> mind. So
> >>>>>>>> this exchange makes me actually more convinced we should merge
> >>>>>>>> python-sdk
> >>>>>>>> to master.
> >>>>>>>>
> >>>>>>>> On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
> >>>>>>>> robertwb@google.com.invalid> wrote:
> >>>>>>>>
> >>>>>>>> On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
> >>>>>>>>
> >>>>>>>> <dh...@google.com.invalid> wrote:
> >>>>>>>>>
> >>>>>>>>> I do not think that Python SDK yet meets the bar [1] for
> >>>>>>>>> implementing
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> the
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>> Beam model -- supporting Unbounded data is very important. That
> >>>>>>>> said,
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> given
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> the committed and sustained set of contributors, it generally
> makes
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> sense
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>> to me to make an exception in anticipation of these features being
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> fleshed
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> out soon; including potentially new users/contributors that would
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> arrive
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>> once in master.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> >>>>>>>>>> k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> That is a valid point. The Python SDK supports all the unbounded
> >>>>>>>>> parts
> >>>>>>>>> of the model except for unbounded sources, which was deferred
> while
> >>>>>>>>> seeing how https://s.apache.org/splittable-do-fn played out.
> I've
> >>>>>>>>> been
> >>>>>>>>> working with the team and merging/reviewing most of their code,
> and
> >>>>>>>>> have full confidence this will be coming (and on that note can
> >>>>>>>>> vouch
> >>>>>>>>> for a healthy community and support which are much harder to add
> >>>>>>>>> later).
> >>>>>>>>>
> >>>>>>>>> In short, I think it has the required maturity, and I'm in favor
> of
> >>>>>>>>> merging soonish.
> >>>>>>>>>
> >>>>>>>>> On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
> >>>>>>>>>
> >>>>>>>>> <altay@google.com.invalid
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Thank you all for the comments so far. I would follow the
> process
> >>>>>>>>>> as
> >>>>>>>>>>
> >>>>>>>>>> suggested by Davor and others in this thread.
> >>>>>>>>>>>
> >>>>>>>>>>> Ahmet
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <
> >>>>>>>>>>> wikier@apache.org
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Hi
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
> >>>>>>>>>>>>
> >>>>>>>>>>>> <altay@google.com.invalid
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> tl;dr: I would like to start a discussion about merging
> >>>>>>>>>>>>> python-sdk
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> branch
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> to master branch. Python SDK is mature enough and merging it to
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> master
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> will
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> accelerate its development and adoption.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Good point, Ahmet!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I've following closed the development since it was imported in
> >>>>>>>>>>>> June.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> the prototypes I've implemented so far it works quite well; I
> >>>>>>>>> guess
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> we'd
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> just need to focus the next months in bringing more runners
> >>>>>>>>> support.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> With a great effort from a lot of contributors(*), Python SDK
> [1]
> >>>>>>>>>>>> is
> >>>>>>>>>>>>
> >>>>>>>>>>>> now
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> a
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> mostly complete, tested, performant Python implementation of
> the
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Beam
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> model. Since June, when we first started with Python SDK in
> >>>>>>>>> Apache
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Beam
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> we
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> have been continuously improving it.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I wouldn't merge during the preparation of 0.5.0 release, but
> >>>>>>>>>>>>>
> >>>>>>>>>>>> after
> >>>>>>>>>>>>
> >>>>>>>>>>>> that
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> could be a good time to merge back into master.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> ** Python SDK currently supports:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> * Model: All main concepts are present (ParDo, GroupByKey,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Windowing
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> etc.).
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> * IO: There are extensible APIs for writing new bounded sources
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> sinks.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Implementations are provided for Text, Avro, BigQuery, and
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Datastore.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> * Runners: Python SDK has an extensible base runner module that
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> allows
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> building specific runners on top of it. The SDK comes with two
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> pipeline
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> runners: DirectRunner and DataflowRunner; and it is possible to
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> add
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> more.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> The existing runners are currently limited to bounded execution
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> otherwise equivalent to their Java SDK counterparts in
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> functionality.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> What would the effort of porting, and maintaining, parallel
> >>>>>>>>>>>>>
> >>>>>>>>>>>> versions
> >>>>>>>>>>>>
> >>>>>>>>>>>> of
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> the
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Java runners? I guess I'd need to dig deeper in the model, but
> >>>>>>>>>>> this
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> may
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> represent a major effort for the project, right?
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> It is somewhat higher for DirectRunner because DirectRunner
> also
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> implements
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> the code for execution. It is not that high for DataflowRunner
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> because
> >>>>>>>>>>>
> >>>>>>>>>>> the
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> base runner module has a lot of helpers with the right hooks for
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> implementing a generic runner. I would _expect_ the experience
> in
> >>>>>>>>>>>
> >>>>>>>>>>> general
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> would be similar to the latter.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> * Testing: Python SDK implements ValidatesRunner test
> framework
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> implementing integration test for current and future runners.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> There
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> is
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> unit
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> test coverage for all modules, and a number of integrations
> test
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> for
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> validating existing runners.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> * Documentation and examples: Documentation work has started on
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>> Python
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> SDK.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Beam Programming Guide page has been updated to include Python
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> [2].
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> The
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> code comes with many ready to use examples and we are in a good
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> place
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> to
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> start documenting those on the website.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> ** We are not done yet, next on the roadmap we have:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * Streaming: Both of the existing runners lack support for
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> streaming
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> execution, and currently there is work going on for adding
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> streaming
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> support to DirectRunner [3].
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> * Documentation: Filling the rest of the Beam documentations with
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>> Python
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> SDK specific information and examples.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> * SDK consistency: Making Python SDK consistent with the Java
> >>>>>>>>>>>>> SDK.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> We
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> have
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> come a long way on this and have only a few items left [4].
> >>>>>>>>>>>>
> >>>>>>>>>>>>> * Beamifying: We have been working on removing
> >>>>>>>>>>>>> Dataflow-specific
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> references
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> both from the documentation and from the code. There is some
> >>>>>>>>>>>> work
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> left,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> and
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> we are currently working on those as well [5].
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> ** Steps and implications of merging to master:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * Master branch is merged to python-sdk branch at regular
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> intervals
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> and
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> the
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> last merge was on 12/22. All the past merges were uneventful
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> because
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> there
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> is a minimal overlap in modified files between branches.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Integrating
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> python-sdk to master will similarly touch a small number of
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> existing
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> files.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>> * Python SDK is using the same tools for building and testing.
> >>>>>>>>>>>>> It
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> already integrated with Maven, Jenkins and Travis. Specifically
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> impact
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> to the testing infrastructure would be:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> - There will be two additional test configurations in Travis.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Since
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> Travis
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> runs all configurations in parallel there should not be a
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> noticeable
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> change
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> in the Travis run time.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> - Jenkins pre-commit test will start running the Python SDK
> >>>>>>>>>>>>> tests.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> will
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> add an additional 5 minutes to the completion time of
> pre-commit
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> test.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> Historically Python SDK tests were not flaky and did not cause
> >>>>>>>>> any
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> random
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> failures.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Jenkins Python post-commit test is already separated from
> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> other
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> post-commit tests and will continue to exist. It would not
> change
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> testing time for any other test.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> * The release process needs to be updated to accommodate
> >>>>>>>>>>>>> releasing
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Python
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> artifacts. Python SDK would fit in the existing release
> schedule
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> could
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> be released along with the Java SDK. The additional steps would
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> include:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> - Generating Python artifacts. This could be done with a single
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> command
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> using Maven today.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> - Publishing the artifacts to a central repository such as PyPI.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm more than happy to help on this. We left on purpose some
> >>>>>>>>>>>>>
> >>>>>>>>>>>> things
> >>>>>>>>>>>>
> >>>>>>>>>>>> open
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> when we added Maven support to the Python build.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> That would be awesome. We can coordinate on that post-merge.
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> - Updating the release guide to reflect the changes above.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> * Users: There are existing users using the Python SDK. To
> >>>>>>>>>>>>> give a
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> rough
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> estimate, a distribution of the Beam Python SDK had a total of
> >>>>>>>>> 23K
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> downloads in the past 6 months [6]. Some of those users are
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> already
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> engaged
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> with the community (e.g. [7]). There might be an increased
> amount
> >>>>>>>>>>>>
> >>>>>>>>>>>>> engagement from the rest of them after the merge.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Python 3 support is something we definitively need to look
> >>>>>>>>>>>>> ahead.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'd
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> try
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> to make the codebase compatible with both 2.7.x and 3.6.x, rather
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> than
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> using other  solutions like 2to3.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>> I agree with you. I think it makes more sense to make codebase
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> compatible
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> with both. As you mentioned Python 3 support is not a short-term
> >>>>>>>>> goal
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> in
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> the roadmap, and we can discuss it more as we approach that.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Looking forward to hearing your thoughts and comments on
> >>>>>>>>>>>>
> >>>>>>>>>>>> “graduating”
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> python-sdk to the master.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Thank you,
> >>>>>>>>>>>>> Ahmet
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> (*) Python SDK branch currently has a diverse group of
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> contributors.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> Regular contributors include Charles Chen, Chamikara Jayalath,
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> María
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> García
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> PMC),
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> Sourabh Bajaj, and Vikas Kedigehalli. We have also had
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> contributions
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> from
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee,
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> Younghee Kwon.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
> >>>>>>>>>>>>> [2] https://beam.apache.org/documentation/programming-guide/
> >>>>>>>>>>>>> [3] https://issues.apache.org/jira/browse/BEAM-1265
> >>>>>>>>>>>>> [4]
> >>>>>>>>>>>>> https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
> >>>>>>>>>>>>> en%20AND%20labels%20%3D%20sdk-consistency
> >>>>>>>>>>>>> [5] https://issues.apache.org/jira/browse/BEAM-1218
> >>>>>>>>>>>>> [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
> >>>>>>>>>>>>> [7] https://issues.apache.org/jira/browse/BEAM-1251
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Great summary, Ahmet. Thanks.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Sergio Fernández
> >>>>>>>>>>>> Partner Technology Manager
> >>>>>>>>>>>> Redlink GmbH
> >>>>>>>>>>>> m: +43 6602747925
> >>>>>>>>>>>> e: sergio.fernandez@redlink.co
> >>>>>>>>>>>> w: http://redlink.co
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>> --
> >>>>>>
> >>>>> Jean-Baptiste Onofré
> >>>>> jbonofre@apache.org
> >>>>> http://blog.nanthrax.net
> >>>>> Talend - http://www.talend.com
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>> Jean-Baptiste Onofré
> >>> jbonofre@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >>>
> >>>
> >>
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: [DISCUSS] Python SDK status and next steps

Posted by Ahmet Altay <al...@google.com.INVALID>.
Thank you Prabeesh and Sergio for fixing those!

On Tue, Jan 31, 2017 at 4:51 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Awesome, thanks Sergio ! Much appreciated ;)
>
> Regards
> JB
>
>
> On 01/31/2017 01:42 PM, Sergio Fernández wrote:
>
>> PR #1879 provides the basics: https://github.com/apache/beam/pull/1879
>>
>> On Tue, Jan 31, 2017 at 1:33 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>
>> No, that's fine as soon as we clearly document the prerequisite for the
>>> build. IMHO, we should provide quick BUILDING instructions in the
>>> README.md.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 01/31/2017 01:24 PM, Sergio Fernández wrote:
>>>
>>> Originally we integrate the build in Maven with the default profile.
>>>> Do you feel like it'd be better to have it under a separated profile or
>>>> so?
>>>>
>>>> On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofré <jb@nanthrax.net
>>>> >
>>>> wrote:
>>>>
>>>> Just to be clear, the prerequisite to be able to build the Python SDK
>>>> are:
>>>>
>>>>>
>>>>> apt-get install python-setuptools
>>>>> apt-get install python-pip
>>>>>
>>>>> It's also required by the default "regular" build.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>>
>>>>> On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:
>>>>>
>>>>> Just one thing I noticed (and can be helpful for others): to build Beam
>>>>>
>>>>>> we now need python setuptools installed.
>>>>>>
>>>>>> For instance, on Ubuntu, you have to do:
>>>>>>
>>>>>> apt-get install python-setuptools
>>>>>>
>>>>>> Same for the pip distribution.
>>>>>>
>>>>>> I guess (if not already done), we have to update README/Building
>>>>>> instructions.
>>>>>>
>>>>>> Correct ?
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> On 01/31/2017 08:10 AM, Ahmet Altay wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>>>
>>>>>>> This merge is completed. Python SDK is now officially part of the
>>>>>>> master
>>>>>>> branch! Thank you all for the support. Please open an issue, if you
>>>>>>> notice
>>>>>>> a reference to the now obsolete python-sdk branch in the
>>>>>>> documentation.
>>>>>>>
>>>>>>> There will not be any more merges to the python-sdk branch. Going
>>>>>>> forward
>>>>>>> please use the master branch for Python SDK development. There are a
>>>>>>> few
>>>>>>> existing open PRs to the python-sdk [1]. If you are the author of one
>>>>>>> of
>>>>>>> those PRs, please rebase them on top of master.
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Ahmet
>>>>>>>
>>>>>>> [1] https://github.com/pulls?utf8=✓&q=is%3Aopen+is%3Apr+base%
>>>>>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25>
>>>>>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25
>>>>>>> >
>>>>>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25
>>>>>>> >
>>>>>>> 3Apython-sdk+repo%3Aapache%2Fbeam+
>>>>>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr
>>>>>>> +base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles
>>>>>>> <kl...@google.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>> To clarify the implied criteria of that last exchange, it is "An SDK
>>>>>>>
>>>>>>> should
>>>>>>>> have at least one runner that can execute the complete model (may
>>>>>>>> be a
>>>>>>>> direct runner)"
>>>>>>>>
>>>>>>>> I want to highlight this, because whether an _SDK_ supports
>>>>>>>> unbounded
>>>>>>>> data
>>>>>>>> is not particularly well-defined, and will evolve:
>>>>>>>>
>>>>>>>>  - With the Runner API, an SDK will need to support building a graph
>>>>>>>> with
>>>>>>>> unbounded constructs, as today with probably minimal changes.
>>>>>>>>
>>>>>>>>  - With the Fn API, if any part of the Fn API is specific to
>>>>>>>> unbounded
>>>>>>>> data, the SDK will need to implement it. I think right now there is
>>>>>>>> no such
>>>>>>>> thing, and we don't want such a thing, so SDKs implementing the Fn
>>>>>>>> API
>>>>>>>> automatically support unbounded data.
>>>>>>>>
>>>>>>>>  - There will also likely be an SDK-specific shim just as there is
>>>>>>>> today,
>>>>>>>> to leverage idiomatic deserialized representations. The richness of
>>>>>>>> this
>>>>>>>> shim will decrease so that it will need to "support" unbounded data
>>>>>>>> but
>>>>>>>> that will be a ~one liner.
>>>>>>>>
>>>>>>>> Getting the Python SDK on master will accelerate our progress
>>>>>>>> towards
>>>>>>>> the
>>>>>>>> Fn API - partly technical, partly community - which is the best path
>>>>>>>> towards support for unbounded data across multiple runners. I think
>>>>>>>> the
>>>>>>>> criteria are written with the completed portability framework in
>>>>>>>> mind. So
>>>>>>>> this exchange makes me actually more convinced we should merge
>>>>>>>> python-sdk
>>>>>>>> to master.
>>>>>>>>
>>>>>>>> On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
>>>>>>>> robertwb@google.com.invalid> wrote:
>>>>>>>>
>>>>>>>> On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>>>>>>>>
>>>>>>>> <dh...@google.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>> I do not think that Python SDK yet meets the bar [1] for
>>>>>>>>> implementing
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Beam model -- supporting Unbounded data is very important. That
>>>>>>>> said,
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> given
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> the committed and sustained set of contributors, it generally makes
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> sense
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> to me to make an exception in anticipation of these features being
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> fleshed
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> out soon; including potentially new users/contributors that would
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> arrive
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> once in master.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
>>>>>>>>>> k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> That is a valid point. The Python SDK supports all the unbounded
>>>>>>>>> parts
>>>>>>>>> of the model except for unbounded sources, which was deferred while
>>>>>>>>> seeing how https://s.apache.org/splittable-do-fn played out. I've
>>>>>>>>> been
>>>>>>>>> working with the team and merging/reviewing most of their code, and
>>>>>>>>> have full confidence this will be coming (and on that note can
>>>>>>>>> vouch
>>>>>>>>> for a healthy community and support which are much harder to add
>>>>>>>>> later).
>>>>>>>>>
>>>>>>>>> In short, I think it has the required maturity, and I'm in favor of
>>>>>>>>> merging soonish.
>>>>>>>>>
>>>>>>>>> On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
>>>>>>>>>
>>>>>>>>> <altay@google.com.invalid
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thank you all for the comments so far. I would follow the process
>>>>>>>>>> as
>>>>>>>>>>
>>>>>>>>>> suggested by Davor and others in this thread.
>>>>>>>>>>>
>>>>>>>>>>> Ahmet
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <
>>>>>>>>>>> wikier@apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Hi
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
>>>>>>>>>>>>
>>>>>>>>>>>> <altay@google.com.invalid
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> tl;dr: I would like to start a discussion about merging
>>>>>>>>>>>>> python-sdk
>>>>>>>>>>>>>
>>>>>>>>>>>>> branch
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> to master branch. Python SDK is mature enough and merging it to
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> master
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> will
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> accelerate its development and adoption.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Good point, Ahmet!
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I've following closed the development since it was imported in
>>>>>>>>>>>> June.
>>>>>>>>>>>>
>>>>>>>>>>>> For
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> the prototypes I've implemented so far it works quite well; I
>>>>>>>>> guess
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> we'd
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> just need to focus the next months in bringing more runners
>>>>>>>>> support.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> With a great effort from a lot of contributors(*), Python SDK [1]
>>>>>>>>>>>> is
>>>>>>>>>>>>
>>>>>>>>>>>> now
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> a
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> mostly complete, tested, performant Python implementation of the
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Beam
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> model. Since June, when we first started with Python SDK in
>>>>>>>>> Apache
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Beam
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> we
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> have been continuously improving it.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I wouldn't merge during the preparation of 0.5.0 release, but
>>>>>>>>>>>>>
>>>>>>>>>>>> after
>>>>>>>>>>>>
>>>>>>>>>>>> that
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> could be a good time to merge back into master.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> ** Python SDK currently supports:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> * Model: All main concepts are present (ParDo, GroupByKey,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Windowing
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> etc.).
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> * IO: There are extensible APIs for writing new bounded sources
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> and
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> sinks.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Implementations are provided for Text, Avro, BigQuery, and
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Datastore.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> * Runners: Python SDK has an extensible base runner module that
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> allows
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> building specific runners on top of it. The SDK comes with two
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> pipeline
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> runners: DirectRunner and DataflowRunner; and it is possible to
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> add
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> more.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> The existing runners are currently limited to bounded execution
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> and
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> otherwise equivalent to their Java SDK counterparts in
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> functionality.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> What would the effort of porting, and maintaining, parallel
>>>>>>>>>>>>>
>>>>>>>>>>>> versions
>>>>>>>>>>>>
>>>>>>>>>>>> of
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Java runners? I guess I'd need to dig deeper in the model, but
>>>>>>>>>>> this
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> may
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> represent a major effort for the project, right?
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> It is somewhat higher for DirectRunner because DirectRunner also
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> implements
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> the code for execution. It is not that high for DataflowRunner
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> because
>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> base runner module has a lot of helpers with the right hooks for
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> implementing a generic runner. I would _expect_ the experience in
>>>>>>>>>>>
>>>>>>>>>>> general
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> would be similar to the latter.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> * Testing: Python SDK implements ValidatesRunner test framework
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> for
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> implementing integration test for current and future runners.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> There
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> is
>>>>>>>>
>>>>>>>>>
>>>>>>>>> unit
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> test coverage for all modules, and a number of integrations test
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> for
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> validating existing runners.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> * Documentation and examples: Documentation work has started on
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> Python
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> SDK.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Beam Programming Guide page has been updated to include Python
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [2].
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> The
>>>>>>>>
>>>>>>>>>
>>>>>>>>> code comes with many ready to use examples and we are in a good
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> place
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> start documenting those on the website.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> ** We are not done yet, next on the roadmap we have:
>>>>>>>>>>>>>
>>>>>>>>>>>>> * Streaming: Both of the existing runners lack support for
>>>>>>>>>>>>>
>>>>>>>>>>>>> streaming
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> execution, and currently there is work going on for adding
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> streaming
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> support to DirectRunner [3].
>>>>>>>>
>>>>>>>>>
>>>>>>>>> * Documentation: Filling the rest of the Beam documentations with
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> Python
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> SDK specific information and examples.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> * SDK consistency: Making Python SDK consistent with the Java
>>>>>>>>>>>>> SDK.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> have
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> come a long way on this and have only a few items left [4].
>>>>>>>>>>>>
>>>>>>>>>>>>> * Beamifying: We have been working on removing
>>>>>>>>>>>>> Dataflow-specific
>>>>>>>>>>>>>
>>>>>>>>>>>>> references
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> both from the documentation and from the code. There is some
>>>>>>>>>>>> work
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> left,
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> we are currently working on those as well [5].
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> ** Steps and implications of merging to master:
>>>>>>>>>>>>>
>>>>>>>>>>>>> * Master branch is merged to python-sdk branch at regular
>>>>>>>>>>>>>
>>>>>>>>>>>>> intervals
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>
>>>>>>>>>
>>>>>>>>> the
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> last merge was on 12/22. All the past merges were uneventful
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> because
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> there
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> is a minimal overlap in modified files between branches.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Integrating
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> python-sdk to master will similarly touch a small number of
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> existing
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> files.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> * Python SDK is using the same tools for building and testing.
>>>>>>>>>>>>> It
>>>>>>>>>>>>>
>>>>>>>>>>>>> is
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> already integrated with Maven, Jenkins and Travis. Specifically
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> impact
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> to the testing infrastructure would be:
>>>>>>>>>>>>
>>>>>>>>>>>>> - There will be two additional test configurations in Travis.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Since
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> Travis
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> runs all configurations in parallel there should not be a
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> noticeable
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> change
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> in the Travis run time.
>>>>>>>>>>>>
>>>>>>>>>>>>> - Jenkins pre-commit test will start running the Python SDK
>>>>>>>>>>>>> tests.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> will
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> add an additional 5 minutes to the completion time of pre-commit
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> test.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> Historically Python SDK tests were not flaky and did not cause
>>>>>>>>> any
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> random
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> failures.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> - Jenkins Python post-commit test is already separated from the
>>>>>>>>>>>>>
>>>>>>>>>>>>> other
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> post-commit tests and will continue to exist. It would not change
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> testing time for any other test.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> * The release process needs to be updated to accommodate
>>>>>>>>>>>>> releasing
>>>>>>>>>>>>>
>>>>>>>>>>>>> Python
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> artifacts. Python SDK would fit in the existing release schedule
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> and
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> could
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> be released along with the Java SDK. The additional steps would
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> include:
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> - Generating Python artifacts. This could be done with a single
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> command
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> using Maven today.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> - Publishing the artifacts to a central repository such as PyPI.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm more than happy to help on this. We left on purpose some
>>>>>>>>>>>>>
>>>>>>>>>>>> things
>>>>>>>>>>>>
>>>>>>>>>>>> open
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> when we added Maven support to the Python build.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> That would be awesome. We can coordinate on that post-merge.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> - Updating the release guide to reflect the changes above.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> * Users: There are existing users using the Python SDK. To
>>>>>>>>>>>>> give a
>>>>>>>>>>>>>
>>>>>>>>>>>>> rough
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> estimate, a distribution of the Beam Python SDK had a total of
>>>>>>>>> 23K
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> downloads in the past 6 months [6]. Some of those users are
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> already
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> engaged
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> with the community (e.g. [7]). There might be an increased amount
>>>>>>>>>>>>
>>>>>>>>>>>>> engagement from the rest of them after the merge.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Python 3 support is something we definitively need to look
>>>>>>>>>>>>> ahead.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I'd
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> try
>>>>>>>>
>>>>>>>>>
>>>>>>>>> to make the codebase compatible with both 2.7.x and 3.6.x, rather
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> than
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> using other  solutions like 2to3.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> I agree with you. I think it makes more sense to make codebase
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> compatible
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> with both. As you mentioned Python 3 support is not a short-term
>>>>>>>>> goal
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> in
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> the roadmap, and we can discuss it more as we approach that.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Looking forward to hearing your thoughts and comments on
>>>>>>>>>>>>
>>>>>>>>>>>> “graduating”
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> python-sdk to the master.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thank you,
>>>>>>>>>>>>> Ahmet
>>>>>>>>>>>>>
>>>>>>>>>>>>> (*) Python SDK branch currently has a diverse group of
>>>>>>>>>>>>>
>>>>>>>>>>>>> contributors.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> Regular contributors include Charles Chen, Chamikara Jayalath,
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> María
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> García
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> PMC),
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> Sourabh Bajaj, and Vikas Kedigehalli. We have also had
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> contributions
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> from
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> and
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> Younghee Kwon.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
>>>>>>>>>>>>> [2] https://beam.apache.org/documentation/programming-guide/
>>>>>>>>>>>>> [3] https://issues.apache.org/jira/browse/BEAM-1265
>>>>>>>>>>>>> [4]
>>>>>>>>>>>>> https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
>>>>>>>>>>>>> en%20AND%20labels%20%3D%20sdk-consistency
>>>>>>>>>>>>> [5] https://issues.apache.org/jira/browse/BEAM-1218
>>>>>>>>>>>>> [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
>>>>>>>>>>>>> [7] https://issues.apache.org/jira/browse/BEAM-1251
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Great summary, Ahmet. Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Sergio Fernández
>>>>>>>>>>>> Partner Technology Manager
>>>>>>>>>>>> Redlink GmbH
>>>>>>>>>>>> m: +43 6602747925
>>>>>>>>>>>> e: sergio.fernandez@redlink.co
>>>>>>>>>>>> w: http://redlink.co
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>
>>>>> Jean-Baptiste Onofré
>>>>> jbonofre@apache.org
>>>>> http://blog.nanthrax.net
>>>>> Talend - http://www.talend.com
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [DISCUSS] Python SDK status and next steps

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Awesome, thanks Sergio ! Much appreciated ;)

Regards
JB

On 01/31/2017 01:42 PM, Sergio Fern�ndez wrote:
> PR #1879 provides the basics: https://github.com/apache/beam/pull/1879
>
> On Tue, Jan 31, 2017 at 1:33 PM, Jean-Baptiste Onofr� <jb...@nanthrax.net>
> wrote:
>
>> No, that's fine as soon as we clearly document the prerequisite for the
>> build. IMHO, we should provide quick BUILDING instructions in the README.md.
>>
>> Regards
>> JB
>>
>>
>> On 01/31/2017 01:24 PM, Sergio Fern�ndez wrote:
>>
>>> Originally we integrate the build in Maven with the default profile.
>>> Do you feel like it'd be better to have it under a separated profile or
>>> so?
>>>
>>> On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofr� <jb...@nanthrax.net>
>>> wrote:
>>>
>>> Just to be clear, the prerequisite to be able to build the Python SDK are:
>>>>
>>>> apt-get install python-setuptools
>>>> apt-get install python-pip
>>>>
>>>> It's also required by the default "regular" build.
>>>>
>>>> Regards
>>>> JB
>>>>
>>>>
>>>> On 01/31/2017 11:02 AM, Jean-Baptiste Onofr� wrote:
>>>>
>>>> Just one thing I noticed (and can be helpful for others): to build Beam
>>>>> we now need python setuptools installed.
>>>>>
>>>>> For instance, on Ubuntu, you have to do:
>>>>>
>>>>> apt-get install python-setuptools
>>>>>
>>>>> Same for the pip distribution.
>>>>>
>>>>> I guess (if not already done), we have to update README/Building
>>>>> instructions.
>>>>>
>>>>> Correct ?
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 01/31/2017 08:10 AM, Ahmet Altay wrote:
>>>>>
>>>>> Hi all,
>>>>>>
>>>>>> This merge is completed. Python SDK is now officially part of the
>>>>>> master
>>>>>> branch! Thank you all for the support. Please open an issue, if you
>>>>>> notice
>>>>>> a reference to the now obsolete python-sdk branch in the documentation.
>>>>>>
>>>>>> There will not be any more merges to the python-sdk branch. Going
>>>>>> forward
>>>>>> please use the master branch for Python SDK development. There are a
>>>>>> few
>>>>>> existing open PRs to the python-sdk [1]. If you are the author of one
>>>>>> of
>>>>>> those PRs, please rebase them on top of master.
>>>>>>
>>>>>> Thank you,
>>>>>> Ahmet
>>>>>>
>>>>>> [1] https://github.com/pulls?utf8=\u2713&q=is%3Aopen+is%3Apr+base%
>>>>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25>
>>>>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25>
>>>>>> 3Apython-sdk+repo%3Aapache%2Fbeam+
>>>>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr
>>>>>> +base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles
>>>>>> <kl...@google.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>> To clarify the implied criteria of that last exchange, it is "An SDK
>>>>>>
>>>>>>> should
>>>>>>> have at least one runner that can execute the complete model (may be a
>>>>>>> direct runner)"
>>>>>>>
>>>>>>> I want to highlight this, because whether an _SDK_ supports unbounded
>>>>>>> data
>>>>>>> is not particularly well-defined, and will evolve:
>>>>>>>
>>>>>>>  - With the Runner API, an SDK will need to support building a graph
>>>>>>> with
>>>>>>> unbounded constructs, as today with probably minimal changes.
>>>>>>>
>>>>>>>  - With the Fn API, if any part of the Fn API is specific to unbounded
>>>>>>> data, the SDK will need to implement it. I think right now there is
>>>>>>> no such
>>>>>>> thing, and we don't want such a thing, so SDKs implementing the Fn API
>>>>>>> automatically support unbounded data.
>>>>>>>
>>>>>>>  - There will also likely be an SDK-specific shim just as there is
>>>>>>> today,
>>>>>>> to leverage idiomatic deserialized representations. The richness of
>>>>>>> this
>>>>>>> shim will decrease so that it will need to "support" unbounded data
>>>>>>> but
>>>>>>> that will be a ~one liner.
>>>>>>>
>>>>>>> Getting the Python SDK on master will accelerate our progress towards
>>>>>>> the
>>>>>>> Fn API - partly technical, partly community - which is the best path
>>>>>>> towards support for unbounded data across multiple runners. I think
>>>>>>> the
>>>>>>> criteria are written with the completed portability framework in
>>>>>>> mind. So
>>>>>>> this exchange makes me actually more convinced we should merge
>>>>>>> python-sdk
>>>>>>> to master.
>>>>>>>
>>>>>>> On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
>>>>>>> robertwb@google.com.invalid> wrote:
>>>>>>>
>>>>>>> On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>>>>>>>
>>>>>>>> <dh...@google.com.invalid> wrote:
>>>>>>>>
>>>>>>>> I do not think that Python SDK yet meets the bar [1] for implementing
>>>>>>>>>
>>>>>>>>> the
>>>>>>>>
>>>>>>>
>>>>>>> Beam model -- supporting Unbounded data is very important. That said,
>>>>>>>>
>>>>>>>>>
>>>>>>>>> given
>>>>>>>>
>>>>>>>> the committed and sustained set of contributors, it generally makes
>>>>>>>>>
>>>>>>>>> sense
>>>>>>>>
>>>>>>>
>>>>>>> to me to make an exception in anticipation of these features being
>>>>>>>>
>>>>>>>>>
>>>>>>>>> fleshed
>>>>>>>>
>>>>>>>> out soon; including potentially new users/contributors that would
>>>>>>>>>
>>>>>>>>> arrive
>>>>>>>>
>>>>>>>
>>>>>>> once in master.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
>>>>>>>>> k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>> That is a valid point. The Python SDK supports all the unbounded
>>>>>>>> parts
>>>>>>>> of the model except for unbounded sources, which was deferred while
>>>>>>>> seeing how https://s.apache.org/splittable-do-fn played out. I've
>>>>>>>> been
>>>>>>>> working with the team and merging/reviewing most of their code, and
>>>>>>>> have full confidence this will be coming (and on that note can vouch
>>>>>>>> for a healthy community and support which are much harder to add
>>>>>>>> later).
>>>>>>>>
>>>>>>>> In short, I think it has the required maturity, and I'm in favor of
>>>>>>>> merging soonish.
>>>>>>>>
>>>>>>>> On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
>>>>>>>>
>>>>>>>>> <altay@google.com.invalid
>>>>>>>>>
>>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thank you all for the comments so far. I would follow the process as
>>>>>>>>>
>>>>>>>>>> suggested by Davor and others in this thread.
>>>>>>>>>>
>>>>>>>>>> Ahmet
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fern�ndez <
>>>>>>>>>> wikier@apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Hi
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
>>>>>>>>>>>
>>>>>>>>>>> <altay@google.com.invalid
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> tl;dr: I would like to start a discussion about merging
>>>>>>>>>>>> python-sdk
>>>>>>>>>>>>
>>>>>>>>>>>> branch
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> to master branch. Python SDK is mature enough and merging it to
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> master
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> will
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> accelerate its development and adoption.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Good point, Ahmet!
>>>>>>>>>>>
>>>>>>>>>>> I've following closed the development since it was imported in
>>>>>>>>>>> June.
>>>>>>>>>>>
>>>>>>>>>>> For
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> the prototypes I've implemented so far it works quite well; I guess
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> we'd
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> just need to focus the next months in bringing more runners support.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> With a great effort from a lot of contributors(*), Python SDK [1]
>>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>> now
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> a
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> mostly complete, tested, performant Python implementation of the
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Beam
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> model. Since June, when we first started with Python SDK in Apache
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> Beam
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> we
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> have been continuously improving it.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I wouldn't merge during the preparation of 0.5.0 release, but
>>>>>>>>>>> after
>>>>>>>>>>>
>>>>>>>>>>> that
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> could be a good time to merge back into master.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ** Python SDK currently supports:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> * Model: All main concepts are present (ParDo, GroupByKey,
>>>>>>>>>>>>
>>>>>>>>>>>> Windowing
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> etc.).
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> * IO: There are extensible APIs for writing new bounded sources
>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> sinks.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Implementations are provided for Text, Avro, BigQuery, and
>>>>>>>>>>>>
>>>>>>>>>>>> Datastore.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> * Runners: Python SDK has an extensible base runner module that
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> allows
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> building specific runners on top of it. The SDK comes with two
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> pipeline
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> runners: DirectRunner and DataflowRunner; and it is possible to
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> add
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> more.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> The existing runners are currently limited to bounded execution
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> otherwise equivalent to their Java SDK counterparts in
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> functionality.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> What would the effort of porting, and maintaining, parallel
>>>>>>>>>>> versions
>>>>>>>>>>>
>>>>>>>>>>> of
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> the
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Java runners? I guess I'd need to dig deeper in the model, but this
>>>>>>>>>>>
>>>>>>>>>>> may
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> represent a major effort for the project, right?
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It is somewhat higher for DirectRunner because DirectRunner also
>>>>>>>>>>
>>>>>>>>>> implements
>>>>>>>>>
>>>>>>>>
>>>>>>>> the code for execution. It is not that high for DataflowRunner
>>>>>>>>>
>>>>>>>>>> because
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>
>>>>>>>>
>>>>>>>> base runner module has a lot of helpers with the right hooks for
>>>>>>>>>
>>>>>>>>>> implementing a generic runner. I would _expect_ the experience in
>>>>>>>>>>
>>>>>>>>>> general
>>>>>>>>>
>>>>>>>>
>>>>>>>> would be similar to the latter.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> * Testing: Python SDK implements ValidatesRunner test framework
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> for
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> implementing integration test for current and future runners.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> There
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> is
>>>>>>>>
>>>>>>>> unit
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> test coverage for all modules, and a number of integrations test
>>>>>>>>>>>>
>>>>>>>>>>>> for
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> validating existing runners.
>>>>>>>>
>>>>>>>>> * Documentation and examples: Documentation work has started on
>>>>>>>>>>>>
>>>>>>>>>>>> Python
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> SDK.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Beam Programming Guide page has been updated to include Python
>>>>>>>>>>>>
>>>>>>>>>>>> [2].
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> The
>>>>>>>>
>>>>>>>> code comes with many ready to use examples and we are in a good
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> place
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> to
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> start documenting those on the website.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ** We are not done yet, next on the roadmap we have:
>>>>>>>>>>>>
>>>>>>>>>>>> * Streaming: Both of the existing runners lack support for
>>>>>>>>>>>>
>>>>>>>>>>>> streaming
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> execution, and currently there is work going on for adding
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> streaming
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> support to DirectRunner [3].
>>>>>>>>
>>>>>>>>> * Documentation: Filling the rest of the Beam documentations with
>>>>>>>>>>>>
>>>>>>>>>>>> Python
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> SDK specific information and examples.
>>>>>>>>>>>
>>>>>>>>>>>> * SDK consistency: Making Python SDK consistent with the Java
>>>>>>>>>>>> SDK.
>>>>>>>>>>>>
>>>>>>>>>>>> We
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> have
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> come a long way on this and have only a few items left [4].
>>>>>>>>>>>> * Beamifying: We have been working on removing Dataflow-specific
>>>>>>>>>>>>
>>>>>>>>>>>> references
>>>>>>>>>>>
>>>>>>>>>>> both from the documentation and from the code. There is some work
>>>>>>>>>>>>
>>>>>>>>>>>> left,
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> and
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> we are currently working on those as well [5].
>>>>>>>>>>>>
>>>>>>>>>>>> ** Steps and implications of merging to master:
>>>>>>>>>>>>
>>>>>>>>>>>> * Master branch is merged to python-sdk branch at regular
>>>>>>>>>>>>
>>>>>>>>>>>> intervals
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> and
>>>>>>>>
>>>>>>>> the
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> last merge was on 12/22. All the past merges were uneventful
>>>>>>>>>>>>
>>>>>>>>>>>> because
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> there
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> is a minimal overlap in modified files between branches.
>>>>>>>>>>>>
>>>>>>>>>>>> Integrating
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> python-sdk to master will similarly touch a small number of
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> existing
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> files.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> * Python SDK is using the same tools for building and testing. It
>>>>>>>>>>>>
>>>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> already integrated with Maven, Jenkins and Travis. Specifically
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> impact
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> to the testing infrastructure would be:
>>>>>>>>>>>> - There will be two additional test configurations in Travis.
>>>>>>>>>>>>
>>>>>>>>>>>> Since
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> Travis
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> runs all configurations in parallel there should not be a
>>>>>>>>>>>>
>>>>>>>>>>>> noticeable
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> change
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> in the Travis run time.
>>>>>>>>>>>> - Jenkins pre-commit test will start running the Python SDK
>>>>>>>>>>>> tests.
>>>>>>>>>>>>
>>>>>>>>>>>> It
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> will
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> add an additional 5 minutes to the completion time of pre-commit
>>>>>>>>>>>>
>>>>>>>>>>>> test.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> Historically Python SDK tests were not flaky and did not cause any
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> random
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> failures.
>>>>>>>>>>>
>>>>>>>>>>>> - Jenkins Python post-commit test is already separated from the
>>>>>>>>>>>>
>>>>>>>>>>>> other
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> post-commit tests and will continue to exist. It would not change
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> testing time for any other test.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> * The release process needs to be updated to accommodate
>>>>>>>>>>>> releasing
>>>>>>>>>>>>
>>>>>>>>>>>> Python
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> artifacts. Python SDK would fit in the existing release schedule
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> could
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> be released along with the Java SDK. The additional steps would
>>>>>>>>>>>>
>>>>>>>>>>>> include:
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> - Generating Python artifacts. This could be done with a single
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> command
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> using Maven today.
>>>>>>>>>
>>>>>>>>>> - Publishing the artifacts to a central repository such as PyPI.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I'm more than happy to help on this. We left on purpose some
>>>>>>>>>>> things
>>>>>>>>>>>
>>>>>>>>>>> open
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> when we added Maven support to the Python build.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> That would be awesome. We can coordinate on that post-merge.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> - Updating the release guide to reflect the changes above.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> * Users: There are existing users using the Python SDK. To give a
>>>>>>>>>>>>
>>>>>>>>>>>> rough
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> estimate, a distribution of the Beam Python SDK had a total of 23K
>>>>>>>>>
>>>>>>>>>> downloads in the past 6 months [6]. Some of those users are
>>>>>>>>>>>>
>>>>>>>>>>>> already
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> engaged
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> with the community (e.g. [7]). There might be an increased amount
>>>>>>>>>>>> engagement from the rest of them after the merge.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Python 3 support is something we definitively need to look ahead.
>>>>>>>>>>>
>>>>>>>>>>> I'd
>>>>>>>>>>
>>>>>>>>>
>>>>>>> try
>>>>>>>>
>>>>>>>> to make the codebase compatible with both 2.7.x and 3.6.x, rather
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> than
>>>>>>>>>>
>>>>>>>>>
>>>>>>> using other  solutions like 2to3.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I agree with you. I think it makes more sense to make codebase
>>>>>>>>>>
>>>>>>>>>> compatible
>>>>>>>>>
>>>>>>>>
>>>>>>>> with both. As you mentioned Python 3 support is not a short-term goal
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> in
>>>>>>>>>
>>>>>>>>
>>>>>>> the roadmap, and we can discuss it more as we approach that.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Looking forward to hearing your thoughts and comments on
>>>>>>>>>>>
>>>>>>>>>>> \u201cgraduating\u201d
>>>>>>>>>>
>>>>>>>>>
>>>>>>> python-sdk to the master.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> Thank you,
>>>>>>>>>>>> Ahmet
>>>>>>>>>>>>
>>>>>>>>>>>> (*) Python SDK branch currently has a diverse group of
>>>>>>>>>>>>
>>>>>>>>>>>> contributors.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> Regular contributors include Charles Chen, Chamikara Jayalath,
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> Mar�a
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> Garc�a
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam
>>>>>>>>>>>>
>>>>>>>>>>>> PMC),
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> Sourabh Bajaj, and Vikas Kedigehalli. We have also had
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> contributions
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> from
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Abdullah Bashir, Marco Buccini, Sergio Fern�ndez, Seunghyun Lee,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>> Younghee Kwon.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
>>>>>>>>>>>> [2] https://beam.apache.org/documentation/programming-guide/
>>>>>>>>>>>> [3] https://issues.apache.org/jira/browse/BEAM-1265
>>>>>>>>>>>> [4]
>>>>>>>>>>>> https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
>>>>>>>>>>>> en%20AND%20labels%20%3D%20sdk-consistency
>>>>>>>>>>>> [5] https://issues.apache.org/jira/browse/BEAM-1218
>>>>>>>>>>>> [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
>>>>>>>>>>>> [7] https://issues.apache.org/jira/browse/BEAM-1251
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> Great summary, Ahmet. Thanks.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Sergio Fern�ndez
>>>>>>>>>>> Partner Technology Manager
>>>>>>>>>>> Redlink GmbH
>>>>>>>>>>> m: +43 6602747925
>>>>>>>>>>> e: sergio.fernandez@redlink.co
>>>>>>>>>>> w: http://redlink.co
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>> --
>>>> Jean-Baptiste Onofr�
>>>> jbonofre@apache.org
>>>> http://blog.nanthrax.net
>>>> Talend - http://www.talend.com
>>>>
>>>>
>>>
>>>
>>>
>> --
>> Jean-Baptiste Onofr�
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSS] Python SDK status and next steps

Posted by Sergio Fernández <wi...@apache.org>.
PR #1879 provides the basics: https://github.com/apache/beam/pull/1879

On Tue, Jan 31, 2017 at 1:33 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> No, that's fine as soon as we clearly document the prerequisite for the
> build. IMHO, we should provide quick BUILDING instructions in the README.md.
>
> Regards
> JB
>
>
> On 01/31/2017 01:24 PM, Sergio Fernández wrote:
>
>> Originally we integrate the build in Maven with the default profile.
>> Do you feel like it'd be better to have it under a separated profile or
>> so?
>>
>> On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>
>> Just to be clear, the prerequisite to be able to build the Python SDK are:
>>>
>>> apt-get install python-setuptools
>>> apt-get install python-pip
>>>
>>> It's also required by the default "regular" build.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:
>>>
>>> Just one thing I noticed (and can be helpful for others): to build Beam
>>>> we now need python setuptools installed.
>>>>
>>>> For instance, on Ubuntu, you have to do:
>>>>
>>>> apt-get install python-setuptools
>>>>
>>>> Same for the pip distribution.
>>>>
>>>> I guess (if not already done), we have to update README/Building
>>>> instructions.
>>>>
>>>> Correct ?
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On 01/31/2017 08:10 AM, Ahmet Altay wrote:
>>>>
>>>> Hi all,
>>>>>
>>>>> This merge is completed. Python SDK is now officially part of the
>>>>> master
>>>>> branch! Thank you all for the support. Please open an issue, if you
>>>>> notice
>>>>> a reference to the now obsolete python-sdk branch in the documentation.
>>>>>
>>>>> There will not be any more merges to the python-sdk branch. Going
>>>>> forward
>>>>> please use the master branch for Python SDK development. There are a
>>>>> few
>>>>> existing open PRs to the python-sdk [1]. If you are the author of one
>>>>> of
>>>>> those PRs, please rebase them on top of master.
>>>>>
>>>>> Thank you,
>>>>> Ahmet
>>>>>
>>>>> [1] https://github.com/pulls?utf8=✓&q=is%3Aopen+is%3Apr+base%
>>>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25>
>>>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25>
>>>>> 3Apython-sdk+repo%3Aapache%2Fbeam+
>>>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr
>>>>> +base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>>>>>
>>>>>
>>>>> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles
>>>>> <kl...@google.com.invalid>
>>>>> wrote:
>>>>>
>>>>> To clarify the implied criteria of that last exchange, it is "An SDK
>>>>>
>>>>>> should
>>>>>> have at least one runner that can execute the complete model (may be a
>>>>>> direct runner)"
>>>>>>
>>>>>> I want to highlight this, because whether an _SDK_ supports unbounded
>>>>>> data
>>>>>> is not particularly well-defined, and will evolve:
>>>>>>
>>>>>>  - With the Runner API, an SDK will need to support building a graph
>>>>>> with
>>>>>> unbounded constructs, as today with probably minimal changes.
>>>>>>
>>>>>>  - With the Fn API, if any part of the Fn API is specific to unbounded
>>>>>> data, the SDK will need to implement it. I think right now there is
>>>>>> no such
>>>>>> thing, and we don't want such a thing, so SDKs implementing the Fn API
>>>>>> automatically support unbounded data.
>>>>>>
>>>>>>  - There will also likely be an SDK-specific shim just as there is
>>>>>> today,
>>>>>> to leverage idiomatic deserialized representations. The richness of
>>>>>> this
>>>>>> shim will decrease so that it will need to "support" unbounded data
>>>>>> but
>>>>>> that will be a ~one liner.
>>>>>>
>>>>>> Getting the Python SDK on master will accelerate our progress towards
>>>>>> the
>>>>>> Fn API - partly technical, partly community - which is the best path
>>>>>> towards support for unbounded data across multiple runners. I think
>>>>>> the
>>>>>> criteria are written with the completed portability framework in
>>>>>> mind. So
>>>>>> this exchange makes me actually more convinced we should merge
>>>>>> python-sdk
>>>>>> to master.
>>>>>>
>>>>>> On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
>>>>>> robertwb@google.com.invalid> wrote:
>>>>>>
>>>>>> On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>>>>>>
>>>>>>> <dh...@google.com.invalid> wrote:
>>>>>>>
>>>>>>> I do not think that Python SDK yet meets the bar [1] for implementing
>>>>>>>>
>>>>>>>> the
>>>>>>>
>>>>>>
>>>>>> Beam model -- supporting Unbounded data is very important. That said,
>>>>>>>
>>>>>>>>
>>>>>>>> given
>>>>>>>
>>>>>>> the committed and sustained set of contributors, it generally makes
>>>>>>>>
>>>>>>>> sense
>>>>>>>
>>>>>>
>>>>>> to me to make an exception in anticipation of these features being
>>>>>>>
>>>>>>>>
>>>>>>>> fleshed
>>>>>>>
>>>>>>> out soon; including potentially new users/contributors that would
>>>>>>>>
>>>>>>>> arrive
>>>>>>>
>>>>>>
>>>>>> once in master.
>>>>>>>
>>>>>>>>
>>>>>>>> [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
>>>>>>>> k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com
>>>>>>>>
>>>>>>>>
>>>>>>> That is a valid point. The Python SDK supports all the unbounded
>>>>>>> parts
>>>>>>> of the model except for unbounded sources, which was deferred while
>>>>>>> seeing how https://s.apache.org/splittable-do-fn played out. I've
>>>>>>> been
>>>>>>> working with the team and merging/reviewing most of their code, and
>>>>>>> have full confidence this will be coming (and on that note can vouch
>>>>>>> for a healthy community and support which are much harder to add
>>>>>>> later).
>>>>>>>
>>>>>>> In short, I think it has the required maturity, and I'm in favor of
>>>>>>> merging soonish.
>>>>>>>
>>>>>>> On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
>>>>>>>
>>>>>>>> <altay@google.com.invalid
>>>>>>>>
>>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Thank you all for the comments so far. I would follow the process as
>>>>>>>>
>>>>>>>>> suggested by Davor and others in this thread.
>>>>>>>>>
>>>>>>>>> Ahmet
>>>>>>>>>
>>>>>>>>> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <
>>>>>>>>> wikier@apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
>>>>>>>>>>
>>>>>>>>>> <altay@google.com.invalid
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> tl;dr: I would like to start a discussion about merging
>>>>>>>>>>> python-sdk
>>>>>>>>>>>
>>>>>>>>>>> branch
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> to master branch. Python SDK is mature enough and merging it to
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> master
>>>>>>>>>>
>>>>>>>>>
>>>>>>> will
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> accelerate its development and adoption.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Good point, Ahmet!
>>>>>>>>>>
>>>>>>>>>> I've following closed the development since it was imported in
>>>>>>>>>> June.
>>>>>>>>>>
>>>>>>>>>> For
>>>>>>>>>
>>>>>>>>
>>>>>>> the prototypes I've implemented so far it works quite well; I guess
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> we'd
>>>>>>>>>
>>>>>>>>
>>>>>>> just need to focus the next months in bringing more runners support.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> With a great effort from a lot of contributors(*), Python SDK [1]
>>>>>>>>>> is
>>>>>>>>>>
>>>>>>>>>> now
>>>>>>>>>
>>>>>>>>
>>>>>>> a
>>>>>>>>
>>>>>>>>>
>>>>>>>>> mostly complete, tested, performant Python implementation of the
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Beam
>>>>>>>>>>
>>>>>>>>>
>>>>>>> model. Since June, when we first started with Python SDK in Apache
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Beam
>>>>>>>>>>
>>>>>>>>>
>>>>>>> we
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> have been continuously improving it.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I wouldn't merge during the preparation of 0.5.0 release, but
>>>>>>>>>> after
>>>>>>>>>>
>>>>>>>>>> that
>>>>>>>>>
>>>>>>>>
>>>>>>> could be a good time to merge back into master.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ** Python SDK currently supports:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> * Model: All main concepts are present (ParDo, GroupByKey,
>>>>>>>>>>>
>>>>>>>>>>> Windowing
>>>>>>>>>>
>>>>>>>>>
>>>>>> etc.).
>>>>>>>
>>>>>>>>
>>>>>>>>>> * IO: There are extensible APIs for writing new bounded sources
>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>
>>>>>> sinks.
>>>>>>>
>>>>>>>>
>>>>>>>>>> Implementations are provided for Text, Avro, BigQuery, and
>>>>>>>>>>>
>>>>>>>>>>> Datastore.
>>>>>>>>>>
>>>>>>>>>
>>>>>>> * Runners: Python SDK has an extensible base runner module that
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> allows
>>>>>>>>>>
>>>>>>>>>
>>>>>>> building specific runners on top of it. The SDK comes with two
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> pipeline
>>>>>>>>>>
>>>>>>>>>
>>>>>>> runners: DirectRunner and DataflowRunner; and it is possible to
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> add
>>>>>>>>>>
>>>>>>>>>
>>>>>> more.
>>>>>>>
>>>>>>>>
>>>>>>>>> The existing runners are currently limited to bounded execution
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>
>>>>>> otherwise equivalent to their Java SDK counterparts in
>>>>>>>
>>>>>>>>
>>>>>>>>>>> functionality.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>>> What would the effort of porting, and maintaining, parallel
>>>>>>>>>> versions
>>>>>>>>>>
>>>>>>>>>> of
>>>>>>>>>
>>>>>>>>
>>>>>>> the
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Java runners? I guess I'd need to dig deeper in the model, but this
>>>>>>>>>>
>>>>>>>>>> may
>>>>>>>>>
>>>>>>>>
>>>>>>> represent a major effort for the project, right?
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> It is somewhat higher for DirectRunner because DirectRunner also
>>>>>>>>>
>>>>>>>>> implements
>>>>>>>>
>>>>>>>
>>>>>>> the code for execution. It is not that high for DataflowRunner
>>>>>>>>
>>>>>>>>> because
>>>>>>>>>
>>>>>>>>> the
>>>>>>>>
>>>>>>>
>>>>>>> base runner module has a lot of helpers with the right hooks for
>>>>>>>>
>>>>>>>>> implementing a generic runner. I would _expect_ the experience in
>>>>>>>>>
>>>>>>>>> general
>>>>>>>>
>>>>>>>
>>>>>>> would be similar to the latter.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> * Testing: Python SDK implements ValidatesRunner test framework
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> for
>>>>>>>>>>
>>>>>>>>>
>>>>>> implementing integration test for current and future runners.
>>>>>>>
>>>>>>>>
>>>>>>>>>>> There
>>>>>>>>>>
>>>>>>>>>
>>>>>> is
>>>>>>>
>>>>>>> unit
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> test coverage for all modules, and a number of integrations test
>>>>>>>>>>>
>>>>>>>>>>> for
>>>>>>>>>>
>>>>>>>>>
>>>>>> validating existing runners.
>>>>>>>
>>>>>>>> * Documentation and examples: Documentation work has started on
>>>>>>>>>>>
>>>>>>>>>>> Python
>>>>>>>>>>
>>>>>>>>>
>>>>>>> SDK.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Beam Programming Guide page has been updated to include Python
>>>>>>>>>>>
>>>>>>>>>>> [2].
>>>>>>>>>>
>>>>>>>>>
>>>>>> The
>>>>>>>
>>>>>>> code comes with many ready to use examples and we are in a good
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> place
>>>>>>>>>>
>>>>>>>>>
>>>>>>> to
>>>>>>>>
>>>>>>>>>
>>>>>>>>> start documenting those on the website.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ** We are not done yet, next on the roadmap we have:
>>>>>>>>>>>
>>>>>>>>>>> * Streaming: Both of the existing runners lack support for
>>>>>>>>>>>
>>>>>>>>>>> streaming
>>>>>>>>>>
>>>>>>>>>
>>>>>> execution, and currently there is work going on for adding
>>>>>>>
>>>>>>>>
>>>>>>>>>>> streaming
>>>>>>>>>>
>>>>>>>>>
>>>>>> support to DirectRunner [3].
>>>>>>>
>>>>>>>> * Documentation: Filling the rest of the Beam documentations with
>>>>>>>>>>>
>>>>>>>>>>> Python
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> SDK specific information and examples.
>>>>>>>>>>
>>>>>>>>>>> * SDK consistency: Making Python SDK consistent with the Java
>>>>>>>>>>> SDK.
>>>>>>>>>>>
>>>>>>>>>>> We
>>>>>>>>>>
>>>>>>>>>
>>>>>>> have
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> come a long way on this and have only a few items left [4].
>>>>>>>>>>> * Beamifying: We have been working on removing Dataflow-specific
>>>>>>>>>>>
>>>>>>>>>>> references
>>>>>>>>>>
>>>>>>>>>> both from the documentation and from the code. There is some work
>>>>>>>>>>>
>>>>>>>>>>> left,
>>>>>>>>>>
>>>>>>>>>
>>>>>>> and
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> we are currently working on those as well [5].
>>>>>>>>>>>
>>>>>>>>>>> ** Steps and implications of merging to master:
>>>>>>>>>>>
>>>>>>>>>>> * Master branch is merged to python-sdk branch at regular
>>>>>>>>>>>
>>>>>>>>>>> intervals
>>>>>>>>>>
>>>>>>>>>
>>>>>> and
>>>>>>>
>>>>>>> the
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> last merge was on 12/22. All the past merges were uneventful
>>>>>>>>>>>
>>>>>>>>>>> because
>>>>>>>>>>
>>>>>>>>>
>>>>>> there
>>>>>>>
>>>>>>>>
>>>>>>>>>> is a minimal overlap in modified files between branches.
>>>>>>>>>>>
>>>>>>>>>>> Integrating
>>>>>>>>>>
>>>>>>>>>
>>>>>> python-sdk to master will similarly touch a small number of
>>>>>>>
>>>>>>>>
>>>>>>>>>>> existing
>>>>>>>>>>
>>>>>>>>>
>>>>>> files.
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> * Python SDK is using the same tools for building and testing. It
>>>>>>>>>>>
>>>>>>>>>>> is
>>>>>>>>>>
>>>>>>>>>
>>>>>> already integrated with Maven, Jenkins and Travis. Specifically
>>>>>>>
>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>
>>>>>> impact
>>>>>>>
>>>>>>>>
>>>>>>>>>> to the testing infrastructure would be:
>>>>>>>>>>> - There will be two additional test configurations in Travis.
>>>>>>>>>>>
>>>>>>>>>>> Since
>>>>>>>>>>
>>>>>>>>>
>>>>>> Travis
>>>>>>>
>>>>>>>>
>>>>>>>>>> runs all configurations in parallel there should not be a
>>>>>>>>>>>
>>>>>>>>>>> noticeable
>>>>>>>>>>
>>>>>>>>>
>>>>>> change
>>>>>>>
>>>>>>>>
>>>>>>>>>> in the Travis run time.
>>>>>>>>>>> - Jenkins pre-commit test will start running the Python SDK
>>>>>>>>>>> tests.
>>>>>>>>>>>
>>>>>>>>>>> It
>>>>>>>>>>
>>>>>>>>>
>>>>>>> will
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> add an additional 5 minutes to the completion time of pre-commit
>>>>>>>>>>>
>>>>>>>>>>> test.
>>>>>>>>>>
>>>>>>>>>
>>>>>>> Historically Python SDK tests were not flaky and did not cause any
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> random
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> failures.
>>>>>>>>>>
>>>>>>>>>>> - Jenkins Python post-commit test is already separated from the
>>>>>>>>>>>
>>>>>>>>>>> other
>>>>>>>>>>
>>>>>>>>>
>>>>>>> post-commit tests and will continue to exist. It would not change
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>
>>>>>>> testing time for any other test.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> * The release process needs to be updated to accommodate
>>>>>>>>>>> releasing
>>>>>>>>>>>
>>>>>>>>>>> Python
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> artifacts. Python SDK would fit in the existing release schedule
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>
>>>>>> could
>>>>>>>
>>>>>>>>
>>>>>>>>>> be released along with the Java SDK. The additional steps would
>>>>>>>>>>>
>>>>>>>>>>> include:
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> - Generating Python artifacts. This could be done with a single
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> command
>>>>>>>>>>
>>>>>>>>>
>>>>>>> using Maven today.
>>>>>>>>
>>>>>>>>> - Publishing the artifacts to a central repository such as PyPI.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I'm more than happy to help on this. We left on purpose some
>>>>>>>>>> things
>>>>>>>>>>
>>>>>>>>>> open
>>>>>>>>>
>>>>>>>>
>>>>>>> when we added Maven support to the Python build.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> That would be awesome. We can coordinate on that post-merge.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> - Updating the release guide to reflect the changes above.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> * Users: There are existing users using the Python SDK. To give a
>>>>>>>>>>>
>>>>>>>>>>> rough
>>>>>>>>>>
>>>>>>>>>
>>>>>>> estimate, a distribution of the Beam Python SDK had a total of 23K
>>>>>>>>
>>>>>>>>> downloads in the past 6 months [6]. Some of those users are
>>>>>>>>>>>
>>>>>>>>>>> already
>>>>>>>>>>
>>>>>>>>>
>>>>>> engaged
>>>>>>>
>>>>>>>>
>>>>>>>>>> with the community (e.g. [7]). There might be an increased amount
>>>>>>>>>>> engagement from the rest of them after the merge.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Python 3 support is something we definitively need to look ahead.
>>>>>>>>>>
>>>>>>>>>> I'd
>>>>>>>>>
>>>>>>>>
>>>>>> try
>>>>>>>
>>>>>>> to make the codebase compatible with both 2.7.x and 3.6.x, rather
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> than
>>>>>>>>>
>>>>>>>>
>>>>>> using other  solutions like 2to3.
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I agree with you. I think it makes more sense to make codebase
>>>>>>>>>
>>>>>>>>> compatible
>>>>>>>>
>>>>>>>
>>>>>>> with both. As you mentioned Python 3 support is not a short-term goal
>>>>>>>>
>>>>>>>>>
>>>>>>>>> in
>>>>>>>>
>>>>>>>
>>>>>> the roadmap, and we can discuss it more as we approach that.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Looking forward to hearing your thoughts and comments on
>>>>>>>>>>
>>>>>>>>>> “graduating”
>>>>>>>>>
>>>>>>>>
>>>>>> python-sdk to the master.
>>>>>>>
>>>>>>>>
>>>>>>>>>>> Thank you,
>>>>>>>>>>> Ahmet
>>>>>>>>>>>
>>>>>>>>>>> (*) Python SDK branch currently has a diverse group of
>>>>>>>>>>>
>>>>>>>>>>> contributors.
>>>>>>>>>>
>>>>>>>>>
>>>>>> Regular contributors include Charles Chen, Chamikara Jayalath,
>>>>>>>
>>>>>>>>
>>>>>>>>>>> María
>>>>>>>>>>
>>>>>>>>>
>>>>>> García
>>>>>>>
>>>>>>>>
>>>>>>>>>> Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam
>>>>>>>>>>>
>>>>>>>>>>> PMC),
>>>>>>>>>>
>>>>>>>>>
>>>>>> Sourabh Bajaj, and Vikas Kedigehalli. We have also had
>>>>>>>
>>>>>>>>
>>>>>>>>>>> contributions
>>>>>>>>>>
>>>>>>>>>
>>>>>> from
>>>>>>>
>>>>>>>>
>>>>>>>>> Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>
>>>>>> Younghee Kwon.
>>>>>>>
>>>>>>>>
>>>>>>>>>>> [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
>>>>>>>>>>> [2] https://beam.apache.org/documentation/programming-guide/
>>>>>>>>>>> [3] https://issues.apache.org/jira/browse/BEAM-1265
>>>>>>>>>>> [4]
>>>>>>>>>>> https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
>>>>>>>>>>> en%20AND%20labels%20%3D%20sdk-consistency
>>>>>>>>>>> [5] https://issues.apache.org/jira/browse/BEAM-1218
>>>>>>>>>>> [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
>>>>>>>>>>> [7] https://issues.apache.org/jira/browse/BEAM-1251
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> Great summary, Ahmet. Thanks.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Sergio Fernández
>>>>>>>>>> Partner Technology Manager
>>>>>>>>>> Redlink GmbH
>>>>>>>>>> m: +43 6602747925
>>>>>>>>>> e: sergio.fernandez@redlink.co
>>>>>>>>>> w: http://redlink.co
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernandez@redlink.co
w: http://redlink.co

Re: [DISCUSS] Python SDK status and next steps

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
No, that's fine as soon as we clearly document the prerequisite for the 
build. IMHO, we should provide quick BUILDING instructions in the README.md.

Regards
JB

On 01/31/2017 01:24 PM, Sergio Fern�ndez wrote:
> Originally we integrate the build in Maven with the default profile.
> Do you feel like it'd be better to have it under a separated profile or so?
>
> On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofr� <jb...@nanthrax.net>
> wrote:
>
>> Just to be clear, the prerequisite to be able to build the Python SDK are:
>>
>> apt-get install python-setuptools
>> apt-get install python-pip
>>
>> It's also required by the default "regular" build.
>>
>> Regards
>> JB
>>
>>
>> On 01/31/2017 11:02 AM, Jean-Baptiste Onofr� wrote:
>>
>>> Just one thing I noticed (and can be helpful for others): to build Beam
>>> we now need python setuptools installed.
>>>
>>> For instance, on Ubuntu, you have to do:
>>>
>>> apt-get install python-setuptools
>>>
>>> Same for the pip distribution.
>>>
>>> I guess (if not already done), we have to update README/Building
>>> instructions.
>>>
>>> Correct ?
>>>
>>> Regards
>>> JB
>>>
>>> On 01/31/2017 08:10 AM, Ahmet Altay wrote:
>>>
>>>> Hi all,
>>>>
>>>> This merge is completed. Python SDK is now officially part of the master
>>>> branch! Thank you all for the support. Please open an issue, if you
>>>> notice
>>>> a reference to the now obsolete python-sdk branch in the documentation.
>>>>
>>>> There will not be any more merges to the python-sdk branch. Going forward
>>>> please use the master branch for Python SDK development. There are a few
>>>> existing open PRs to the python-sdk [1]. If you are the author of one of
>>>> those PRs, please rebase them on top of master.
>>>>
>>>> Thank you,
>>>> Ahmet
>>>>
>>>> [1] https://github.com/pulls?utf8=\u2713&q=is%3Aopen+is%3Apr+base%
>>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25>
>>>> 3Apython-sdk+repo%3Aapache%2Fbeam+
>>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr
>>>> +base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>>>>
>>>>
>>>> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles
>>>> <kl...@google.com.invalid>
>>>> wrote:
>>>>
>>>> To clarify the implied criteria of that last exchange, it is "An SDK
>>>>> should
>>>>> have at least one runner that can execute the complete model (may be a
>>>>> direct runner)"
>>>>>
>>>>> I want to highlight this, because whether an _SDK_ supports unbounded
>>>>> data
>>>>> is not particularly well-defined, and will evolve:
>>>>>
>>>>>  - With the Runner API, an SDK will need to support building a graph
>>>>> with
>>>>> unbounded constructs, as today with probably minimal changes.
>>>>>
>>>>>  - With the Fn API, if any part of the Fn API is specific to unbounded
>>>>> data, the SDK will need to implement it. I think right now there is
>>>>> no such
>>>>> thing, and we don't want such a thing, so SDKs implementing the Fn API
>>>>> automatically support unbounded data.
>>>>>
>>>>>  - There will also likely be an SDK-specific shim just as there is
>>>>> today,
>>>>> to leverage idiomatic deserialized representations. The richness of this
>>>>> shim will decrease so that it will need to "support" unbounded data but
>>>>> that will be a ~one liner.
>>>>>
>>>>> Getting the Python SDK on master will accelerate our progress towards
>>>>> the
>>>>> Fn API - partly technical, partly community - which is the best path
>>>>> towards support for unbounded data across multiple runners. I think the
>>>>> criteria are written with the completed portability framework in
>>>>> mind. So
>>>>> this exchange makes me actually more convinced we should merge
>>>>> python-sdk
>>>>> to master.
>>>>>
>>>>> On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
>>>>> robertwb@google.com.invalid> wrote:
>>>>>
>>>>> On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>>>>>> <dh...@google.com.invalid> wrote:
>>>>>>
>>>>>>> I do not think that Python SDK yet meets the bar [1] for implementing
>>>>>>>
>>>>>> the
>>>>>
>>>>>> Beam model -- supporting Unbounded data is very important. That said,
>>>>>>>
>>>>>> given
>>>>>>
>>>>>>> the committed and sustained set of contributors, it generally makes
>>>>>>>
>>>>>> sense
>>>>>
>>>>>> to me to make an exception in anticipation of these features being
>>>>>>>
>>>>>> fleshed
>>>>>>
>>>>>>> out soon; including potentially new users/contributors that would
>>>>>>>
>>>>>> arrive
>>>>>
>>>>>> once in master.
>>>>>>>
>>>>>>> [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
>>>>>>> k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com
>>>>>>>
>>>>>>
>>>>>> That is a valid point. The Python SDK supports all the unbounded parts
>>>>>> of the model except for unbounded sources, which was deferred while
>>>>>> seeing how https://s.apache.org/splittable-do-fn played out. I've been
>>>>>> working with the team and merging/reviewing most of their code, and
>>>>>> have full confidence this will be coming (and on that note can vouch
>>>>>> for a healthy community and support which are much harder to add
>>>>>> later).
>>>>>>
>>>>>> In short, I think it has the required maturity, and I'm in favor of
>>>>>> merging soonish.
>>>>>>
>>>>>> On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
>>>>>>> <altay@google.com.invalid
>>>>>>>
>>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>> Thank you all for the comments so far. I would follow the process as
>>>>>>>> suggested by Davor and others in this thread.
>>>>>>>>
>>>>>>>> Ahmet
>>>>>>>>
>>>>>>>> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fern�ndez <
>>>>>>>> wikier@apache.org
>>>>>>>>
>>>>>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
>>>>>>>>>
>>>>>>>> <altay@google.com.invalid
>>>>>
>>>>>>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> tl;dr: I would like to start a discussion about merging python-sdk
>>>>>>>>>>
>>>>>>>>> branch
>>>>>>>>
>>>>>>>>> to master branch. Python SDK is mature enough and merging it to
>>>>>>>>>>
>>>>>>>>> master
>>>>>>
>>>>>>> will
>>>>>>>>>
>>>>>>>>>> accelerate its development and adoption.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Good point, Ahmet!
>>>>>>>>>
>>>>>>>>> I've following closed the development since it was imported in June.
>>>>>>>>>
>>>>>>>> For
>>>>>>
>>>>>>> the prototypes I've implemented so far it works quite well; I guess
>>>>>>>>>
>>>>>>>> we'd
>>>>>>
>>>>>>> just need to focus the next months in bringing more runners support.
>>>>>>>>>
>>>>>>>>> With a great effort from a lot of contributors(*), Python SDK [1] is
>>>>>>>>>
>>>>>>>> now
>>>>>>
>>>>>>> a
>>>>>>>>
>>>>>>>>> mostly complete, tested, performant Python implementation of the
>>>>>>>>>>
>>>>>>>>> Beam
>>>>>>
>>>>>>> model. Since June, when we first started with Python SDK in Apache
>>>>>>>>>>
>>>>>>>>> Beam
>>>>>>
>>>>>>> we
>>>>>>>>>
>>>>>>>>>> have been continuously improving it.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> I wouldn't merge during the preparation of 0.5.0 release, but after
>>>>>>>>>
>>>>>>>> that
>>>>>>
>>>>>>> could be a good time to merge back into master.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ** Python SDK currently supports:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> * Model: All main concepts are present (ParDo, GroupByKey,
>>>>>>>>>>
>>>>>>>>> Windowing
>>>>>
>>>>>> etc.).
>>>>>>>>>
>>>>>>>>>> * IO: There are extensible APIs for writing new bounded sources
>>>>>>>>>>
>>>>>>>>> and
>>>>>
>>>>>> sinks.
>>>>>>>>>
>>>>>>>>>> Implementations are provided for Text, Avro, BigQuery, and
>>>>>>>>>>
>>>>>>>>> Datastore.
>>>>>>
>>>>>>> * Runners: Python SDK has an extensible base runner module that
>>>>>>>>>>
>>>>>>>>> allows
>>>>>>
>>>>>>> building specific runners on top of it. The SDK comes with two
>>>>>>>>>>
>>>>>>>>> pipeline
>>>>>>
>>>>>>> runners: DirectRunner and DataflowRunner; and it is possible to
>>>>>>>>>>
>>>>>>>>> add
>>>>>
>>>>>> more.
>>>>>>>>
>>>>>>>>> The existing runners are currently limited to bounded execution
>>>>>>>>>>
>>>>>>>>> and
>>>>>
>>>>>> otherwise equivalent to their Java SDK counterparts in
>>>>>>>>>>
>>>>>>>>> functionality.
>>>>>>
>>>>>>>
>>>>>>>>>>
>>>>>>>>> What would the effort of porting, and maintaining, parallel versions
>>>>>>>>>
>>>>>>>> of
>>>>>>
>>>>>>> the
>>>>>>>>
>>>>>>>>> Java runners? I guess I'd need to dig deeper in the model, but this
>>>>>>>>>
>>>>>>>> may
>>>>>>
>>>>>>> represent a major effort for the project, right?
>>>>>>>>>
>>>>>>>>>
>>>>>>>> It is somewhat higher for DirectRunner because DirectRunner also
>>>>>>>>
>>>>>>> implements
>>>>>>
>>>>>>> the code for execution. It is not that high for DataflowRunner
>>>>>>>> because
>>>>>>>>
>>>>>>> the
>>>>>>
>>>>>>> base runner module has a lot of helpers with the right hooks for
>>>>>>>> implementing a generic runner. I would _expect_ the experience in
>>>>>>>>
>>>>>>> general
>>>>>>
>>>>>>> would be similar to the latter.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> * Testing: Python SDK implements ValidatesRunner test framework
>>>>>>>>>>
>>>>>>>>> for
>>>>>
>>>>>> implementing integration test for current and future runners.
>>>>>>>>>>
>>>>>>>>> There
>>>>>
>>>>>> is
>>>>>>
>>>>>>> unit
>>>>>>>>>
>>>>>>>>>> test coverage for all modules, and a number of integrations test
>>>>>>>>>>
>>>>>>>>> for
>>>>>
>>>>>> validating existing runners.
>>>>>>>>>> * Documentation and examples: Documentation work has started on
>>>>>>>>>>
>>>>>>>>> Python
>>>>>>
>>>>>>> SDK.
>>>>>>>>>
>>>>>>>>>> Beam Programming Guide page has been updated to include Python
>>>>>>>>>>
>>>>>>>>> [2].
>>>>>
>>>>>> The
>>>>>>
>>>>>>> code comes with many ready to use examples and we are in a good
>>>>>>>>>>
>>>>>>>>> place
>>>>>>
>>>>>>> to
>>>>>>>>
>>>>>>>>> start documenting those on the website.
>>>>>>>>>>
>>>>>>>>>> ** We are not done yet, next on the roadmap we have:
>>>>>>>>>>
>>>>>>>>>> * Streaming: Both of the existing runners lack support for
>>>>>>>>>>
>>>>>>>>> streaming
>>>>>
>>>>>> execution, and currently there is work going on for adding
>>>>>>>>>>
>>>>>>>>> streaming
>>>>>
>>>>>> support to DirectRunner [3].
>>>>>>>>>> * Documentation: Filling the rest of the Beam documentations with
>>>>>>>>>>
>>>>>>>>> Python
>>>>>>>>
>>>>>>>>> SDK specific information and examples.
>>>>>>>>>> * SDK consistency: Making Python SDK consistent with the Java SDK.
>>>>>>>>>>
>>>>>>>>> We
>>>>>>
>>>>>>> have
>>>>>>>>>
>>>>>>>>>> come a long way on this and have only a few items left [4].
>>>>>>>>>> * Beamifying: We have been working on removing Dataflow-specific
>>>>>>>>>>
>>>>>>>>> references
>>>>>>>>>
>>>>>>>>>> both from the documentation and from the code. There is some work
>>>>>>>>>>
>>>>>>>>> left,
>>>>>>
>>>>>>> and
>>>>>>>>>
>>>>>>>>>> we are currently working on those as well [5].
>>>>>>>>>>
>>>>>>>>>> ** Steps and implications of merging to master:
>>>>>>>>>>
>>>>>>>>>> * Master branch is merged to python-sdk branch at regular
>>>>>>>>>>
>>>>>>>>> intervals
>>>>>
>>>>>> and
>>>>>>
>>>>>>> the
>>>>>>>>>
>>>>>>>>>> last merge was on 12/22. All the past merges were uneventful
>>>>>>>>>>
>>>>>>>>> because
>>>>>
>>>>>> there
>>>>>>>>>
>>>>>>>>>> is a minimal overlap in modified files between branches.
>>>>>>>>>>
>>>>>>>>> Integrating
>>>>>
>>>>>> python-sdk to master will similarly touch a small number of
>>>>>>>>>>
>>>>>>>>> existing
>>>>>
>>>>>> files.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> * Python SDK is using the same tools for building and testing. It
>>>>>>>>>>
>>>>>>>>> is
>>>>>
>>>>>> already integrated with Maven, Jenkins and Travis. Specifically
>>>>>>>>>>
>>>>>>>>> the
>>>>>
>>>>>> impact
>>>>>>>>>
>>>>>>>>>> to the testing infrastructure would be:
>>>>>>>>>> - There will be two additional test configurations in Travis.
>>>>>>>>>>
>>>>>>>>> Since
>>>>>
>>>>>> Travis
>>>>>>>>>
>>>>>>>>>> runs all configurations in parallel there should not be a
>>>>>>>>>>
>>>>>>>>> noticeable
>>>>>
>>>>>> change
>>>>>>>>>
>>>>>>>>>> in the Travis run time.
>>>>>>>>>> - Jenkins pre-commit test will start running the Python SDK tests.
>>>>>>>>>>
>>>>>>>>> It
>>>>>>
>>>>>>> will
>>>>>>>>>
>>>>>>>>>> add an additional 5 minutes to the completion time of pre-commit
>>>>>>>>>>
>>>>>>>>> test.
>>>>>>
>>>>>>> Historically Python SDK tests were not flaky and did not cause any
>>>>>>>>>>
>>>>>>>>> random
>>>>>>>>
>>>>>>>>> failures.
>>>>>>>>>> - Jenkins Python post-commit test is already separated from the
>>>>>>>>>>
>>>>>>>>> other
>>>>>>
>>>>>>> post-commit tests and will continue to exist. It would not change
>>>>>>>>>>
>>>>>>>>> the
>>>>>>
>>>>>>> testing time for any other test.
>>>>>>>>>>
>>>>>>>>>> * The release process needs to be updated to accommodate releasing
>>>>>>>>>>
>>>>>>>>> Python
>>>>>>>>
>>>>>>>>> artifacts. Python SDK would fit in the existing release schedule
>>>>>>>>>>
>>>>>>>>> and
>>>>>
>>>>>> could
>>>>>>>>>
>>>>>>>>>> be released along with the Java SDK. The additional steps would
>>>>>>>>>>
>>>>>>>>> include:
>>>>>>>>
>>>>>>>>> - Generating Python artifacts. This could be done with a single
>>>>>>>>>>
>>>>>>>>> command
>>>>>>
>>>>>>> using Maven today.
>>>>>>>>>> - Publishing the artifacts to a central repository such as PyPI.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> I'm more than happy to help on this. We left on purpose some things
>>>>>>>>>
>>>>>>>> open
>>>>>>
>>>>>>> when we added Maven support to the Python build.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> That would be awesome. We can coordinate on that post-merge.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> - Updating the release guide to reflect the changes above.
>>>>>>>>>>
>>>>>>>>>> * Users: There are existing users using the Python SDK. To give a
>>>>>>>>>>
>>>>>>>>> rough
>>>>>>
>>>>>>> estimate, a distribution of the Beam Python SDK had a total of 23K
>>>>>>>>>> downloads in the past 6 months [6]. Some of those users are
>>>>>>>>>>
>>>>>>>>> already
>>>>>
>>>>>> engaged
>>>>>>>>>
>>>>>>>>>> with the community (e.g. [7]). There might be an increased amount
>>>>>>>>>> engagement from the rest of them after the merge.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Python 3 support is something we definitively need to look ahead.
>>>>>>>>>
>>>>>>>> I'd
>>>>>
>>>>>> try
>>>>>>
>>>>>>> to make the codebase compatible with both 2.7.x and 3.6.x, rather
>>>>>>>>>
>>>>>>>> than
>>>>>
>>>>>> using other  solutions like 2to3.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I agree with you. I think it makes more sense to make codebase
>>>>>>>>
>>>>>>> compatible
>>>>>>
>>>>>>> with both. As you mentioned Python 3 support is not a short-term goal
>>>>>>>>
>>>>>>> in
>>>>>
>>>>>> the roadmap, and we can discuss it more as we approach that.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Looking forward to hearing your thoughts and comments on
>>>>>>>>>
>>>>>>>> \u201cgraduating\u201d
>>>>>
>>>>>> python-sdk to the master.
>>>>>>>>>>
>>>>>>>>>> Thank you,
>>>>>>>>>> Ahmet
>>>>>>>>>>
>>>>>>>>>> (*) Python SDK branch currently has a diverse group of
>>>>>>>>>>
>>>>>>>>> contributors.
>>>>>
>>>>>> Regular contributors include Charles Chen, Chamikara Jayalath,
>>>>>>>>>>
>>>>>>>>> Mar�a
>>>>>
>>>>>> Garc�a
>>>>>>>>>
>>>>>>>>>> Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam
>>>>>>>>>>
>>>>>>>>> PMC),
>>>>>
>>>>>> Sourabh Bajaj, and Vikas Kedigehalli. We have also had
>>>>>>>>>>
>>>>>>>>> contributions
>>>>>
>>>>>> from
>>>>>>>>
>>>>>>>>> Abdullah Bashir, Marco Buccini, Sergio Fern�ndez, Seunghyun Lee,
>>>>>>>>>>
>>>>>>>>> and
>>>>>
>>>>>> Younghee Kwon.
>>>>>>>>>>
>>>>>>>>>> [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
>>>>>>>>>> [2] https://beam.apache.org/documentation/programming-guide/
>>>>>>>>>> [3] https://issues.apache.org/jira/browse/BEAM-1265
>>>>>>>>>> [4]
>>>>>>>>>> https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
>>>>>>>>>> en%20AND%20labels%20%3D%20sdk-consistency
>>>>>>>>>> [5] https://issues.apache.org/jira/browse/BEAM-1218
>>>>>>>>>> [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
>>>>>>>>>> [7] https://issues.apache.org/jira/browse/BEAM-1251
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Great summary, Ahmet. Thanks.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Sergio Fern�ndez
>>>>>>>>> Partner Technology Manager
>>>>>>>>> Redlink GmbH
>>>>>>>>> m: +43 6602747925
>>>>>>>>> e: sergio.fernandez@redlink.co
>>>>>>>>> w: http://redlink.co
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> --
>> Jean-Baptiste Onofr�
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSS] Python SDK status and next steps

Posted by Sergio Fernández <wi...@apache.org>.
Originally we integrate the build in Maven with the default profile.
Do you feel like it'd be better to have it under a separated profile or so?

On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Just to be clear, the prerequisite to be able to build the Python SDK are:
>
> apt-get install python-setuptools
> apt-get install python-pip
>
> It's also required by the default "regular" build.
>
> Regards
> JB
>
>
> On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:
>
>> Just one thing I noticed (and can be helpful for others): to build Beam
>> we now need python setuptools installed.
>>
>> For instance, on Ubuntu, you have to do:
>>
>> apt-get install python-setuptools
>>
>> Same for the pip distribution.
>>
>> I guess (if not already done), we have to update README/Building
>> instructions.
>>
>> Correct ?
>>
>> Regards
>> JB
>>
>> On 01/31/2017 08:10 AM, Ahmet Altay wrote:
>>
>>> Hi all,
>>>
>>> This merge is completed. Python SDK is now officially part of the master
>>> branch! Thank you all for the support. Please open an issue, if you
>>> notice
>>> a reference to the now obsolete python-sdk branch in the documentation.
>>>
>>> There will not be any more merges to the python-sdk branch. Going forward
>>> please use the master branch for Python SDK development. There are a few
>>> existing open PRs to the python-sdk [1]. If you are the author of one of
>>> those PRs, please rebase them on top of master.
>>>
>>> Thank you,
>>> Ahmet
>>>
>>> [1] https://github.com/pulls?utf8=✓&q=is%3Aopen+is%3Apr+base%
>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25>
>>> 3Apython-sdk+repo%3Aapache%2Fbeam+
>>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr
>>> +base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>>>
>>>
>>> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles
>>> <kl...@google.com.invalid>
>>> wrote:
>>>
>>> To clarify the implied criteria of that last exchange, it is "An SDK
>>>> should
>>>> have at least one runner that can execute the complete model (may be a
>>>> direct runner)"
>>>>
>>>> I want to highlight this, because whether an _SDK_ supports unbounded
>>>> data
>>>> is not particularly well-defined, and will evolve:
>>>>
>>>>  - With the Runner API, an SDK will need to support building a graph
>>>> with
>>>> unbounded constructs, as today with probably minimal changes.
>>>>
>>>>  - With the Fn API, if any part of the Fn API is specific to unbounded
>>>> data, the SDK will need to implement it. I think right now there is
>>>> no such
>>>> thing, and we don't want such a thing, so SDKs implementing the Fn API
>>>> automatically support unbounded data.
>>>>
>>>>  - There will also likely be an SDK-specific shim just as there is
>>>> today,
>>>> to leverage idiomatic deserialized representations. The richness of this
>>>> shim will decrease so that it will need to "support" unbounded data but
>>>> that will be a ~one liner.
>>>>
>>>> Getting the Python SDK on master will accelerate our progress towards
>>>> the
>>>> Fn API - partly technical, partly community - which is the best path
>>>> towards support for unbounded data across multiple runners. I think the
>>>> criteria are written with the completed portability framework in
>>>> mind. So
>>>> this exchange makes me actually more convinced we should merge
>>>> python-sdk
>>>> to master.
>>>>
>>>> On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
>>>> robertwb@google.com.invalid> wrote:
>>>>
>>>> On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>>>>> <dh...@google.com.invalid> wrote:
>>>>>
>>>>>> I do not think that Python SDK yet meets the bar [1] for implementing
>>>>>>
>>>>> the
>>>>
>>>>> Beam model -- supporting Unbounded data is very important. That said,
>>>>>>
>>>>> given
>>>>>
>>>>>> the committed and sustained set of contributors, it generally makes
>>>>>>
>>>>> sense
>>>>
>>>>> to me to make an exception in anticipation of these features being
>>>>>>
>>>>> fleshed
>>>>>
>>>>>> out soon; including potentially new users/contributors that would
>>>>>>
>>>>> arrive
>>>>
>>>>> once in master.
>>>>>>
>>>>>> [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
>>>>>> k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com
>>>>>>
>>>>>
>>>>> That is a valid point. The Python SDK supports all the unbounded parts
>>>>> of the model except for unbounded sources, which was deferred while
>>>>> seeing how https://s.apache.org/splittable-do-fn played out. I've been
>>>>> working with the team and merging/reviewing most of their code, and
>>>>> have full confidence this will be coming (and on that note can vouch
>>>>> for a healthy community and support which are much harder to add
>>>>> later).
>>>>>
>>>>> In short, I think it has the required maturity, and I'm in favor of
>>>>> merging soonish.
>>>>>
>>>>> On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
>>>>>> <altay@google.com.invalid
>>>>>>
>>>>>
>>>>> wrote:
>>>>>>
>>>>>> Thank you all for the comments so far. I would follow the process as
>>>>>>> suggested by Davor and others in this thread.
>>>>>>>
>>>>>>> Ahmet
>>>>>>>
>>>>>>> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <
>>>>>>> wikier@apache.org
>>>>>>>
>>>>>>
>>>>> wrote:
>>>>>>>
>>>>>>> Hi
>>>>>>>>
>>>>>>>> On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
>>>>>>>>
>>>>>>> <altay@google.com.invalid
>>>>
>>>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> tl;dr: I would like to start a discussion about merging python-sdk
>>>>>>>>>
>>>>>>>> branch
>>>>>>>
>>>>>>>> to master branch. Python SDK is mature enough and merging it to
>>>>>>>>>
>>>>>>>> master
>>>>>
>>>>>> will
>>>>>>>>
>>>>>>>>> accelerate its development and adoption.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Good point, Ahmet!
>>>>>>>>
>>>>>>>> I've following closed the development since it was imported in June.
>>>>>>>>
>>>>>>> For
>>>>>
>>>>>> the prototypes I've implemented so far it works quite well; I guess
>>>>>>>>
>>>>>>> we'd
>>>>>
>>>>>> just need to focus the next months in bringing more runners support.
>>>>>>>>
>>>>>>>> With a great effort from a lot of contributors(*), Python SDK [1] is
>>>>>>>>
>>>>>>> now
>>>>>
>>>>>> a
>>>>>>>
>>>>>>>> mostly complete, tested, performant Python implementation of the
>>>>>>>>>
>>>>>>>> Beam
>>>>>
>>>>>> model. Since June, when we first started with Python SDK in Apache
>>>>>>>>>
>>>>>>>> Beam
>>>>>
>>>>>> we
>>>>>>>>
>>>>>>>>> have been continuously improving it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I wouldn't merge during the preparation of 0.5.0 release, but after
>>>>>>>>
>>>>>>> that
>>>>>
>>>>>> could be a good time to merge back into master.
>>>>>>>>
>>>>>>>>
>>>>>>>> ** Python SDK currently supports:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> * Model: All main concepts are present (ParDo, GroupByKey,
>>>>>>>>>
>>>>>>>> Windowing
>>>>
>>>>> etc.).
>>>>>>>>
>>>>>>>>> * IO: There are extensible APIs for writing new bounded sources
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>> sinks.
>>>>>>>>
>>>>>>>>> Implementations are provided for Text, Avro, BigQuery, and
>>>>>>>>>
>>>>>>>> Datastore.
>>>>>
>>>>>> * Runners: Python SDK has an extensible base runner module that
>>>>>>>>>
>>>>>>>> allows
>>>>>
>>>>>> building specific runners on top of it. The SDK comes with two
>>>>>>>>>
>>>>>>>> pipeline
>>>>>
>>>>>> runners: DirectRunner and DataflowRunner; and it is possible to
>>>>>>>>>
>>>>>>>> add
>>>>
>>>>> more.
>>>>>>>
>>>>>>>> The existing runners are currently limited to bounded execution
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>> otherwise equivalent to their Java SDK counterparts in
>>>>>>>>>
>>>>>>>> functionality.
>>>>>
>>>>>>
>>>>>>>>>
>>>>>>>> What would the effort of porting, and maintaining, parallel versions
>>>>>>>>
>>>>>>> of
>>>>>
>>>>>> the
>>>>>>>
>>>>>>>> Java runners? I guess I'd need to dig deeper in the model, but this
>>>>>>>>
>>>>>>> may
>>>>>
>>>>>> represent a major effort for the project, right?
>>>>>>>>
>>>>>>>>
>>>>>>> It is somewhat higher for DirectRunner because DirectRunner also
>>>>>>>
>>>>>> implements
>>>>>
>>>>>> the code for execution. It is not that high for DataflowRunner
>>>>>>> because
>>>>>>>
>>>>>> the
>>>>>
>>>>>> base runner module has a lot of helpers with the right hooks for
>>>>>>> implementing a generic runner. I would _expect_ the experience in
>>>>>>>
>>>>>> general
>>>>>
>>>>>> would be similar to the latter.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> * Testing: Python SDK implements ValidatesRunner test framework
>>>>>>>>>
>>>>>>>> for
>>>>
>>>>> implementing integration test for current and future runners.
>>>>>>>>>
>>>>>>>> There
>>>>
>>>>> is
>>>>>
>>>>>> unit
>>>>>>>>
>>>>>>>>> test coverage for all modules, and a number of integrations test
>>>>>>>>>
>>>>>>>> for
>>>>
>>>>> validating existing runners.
>>>>>>>>> * Documentation and examples: Documentation work has started on
>>>>>>>>>
>>>>>>>> Python
>>>>>
>>>>>> SDK.
>>>>>>>>
>>>>>>>>> Beam Programming Guide page has been updated to include Python
>>>>>>>>>
>>>>>>>> [2].
>>>>
>>>>> The
>>>>>
>>>>>> code comes with many ready to use examples and we are in a good
>>>>>>>>>
>>>>>>>> place
>>>>>
>>>>>> to
>>>>>>>
>>>>>>>> start documenting those on the website.
>>>>>>>>>
>>>>>>>>> ** We are not done yet, next on the roadmap we have:
>>>>>>>>>
>>>>>>>>> * Streaming: Both of the existing runners lack support for
>>>>>>>>>
>>>>>>>> streaming
>>>>
>>>>> execution, and currently there is work going on for adding
>>>>>>>>>
>>>>>>>> streaming
>>>>
>>>>> support to DirectRunner [3].
>>>>>>>>> * Documentation: Filling the rest of the Beam documentations with
>>>>>>>>>
>>>>>>>> Python
>>>>>>>
>>>>>>>> SDK specific information and examples.
>>>>>>>>> * SDK consistency: Making Python SDK consistent with the Java SDK.
>>>>>>>>>
>>>>>>>> We
>>>>>
>>>>>> have
>>>>>>>>
>>>>>>>>> come a long way on this and have only a few items left [4].
>>>>>>>>> * Beamifying: We have been working on removing Dataflow-specific
>>>>>>>>>
>>>>>>>> references
>>>>>>>>
>>>>>>>>> both from the documentation and from the code. There is some work
>>>>>>>>>
>>>>>>>> left,
>>>>>
>>>>>> and
>>>>>>>>
>>>>>>>>> we are currently working on those as well [5].
>>>>>>>>>
>>>>>>>>> ** Steps and implications of merging to master:
>>>>>>>>>
>>>>>>>>> * Master branch is merged to python-sdk branch at regular
>>>>>>>>>
>>>>>>>> intervals
>>>>
>>>>> and
>>>>>
>>>>>> the
>>>>>>>>
>>>>>>>>> last merge was on 12/22. All the past merges were uneventful
>>>>>>>>>
>>>>>>>> because
>>>>
>>>>> there
>>>>>>>>
>>>>>>>>> is a minimal overlap in modified files between branches.
>>>>>>>>>
>>>>>>>> Integrating
>>>>
>>>>> python-sdk to master will similarly touch a small number of
>>>>>>>>>
>>>>>>>> existing
>>>>
>>>>> files.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> * Python SDK is using the same tools for building and testing. It
>>>>>>>>>
>>>>>>>> is
>>>>
>>>>> already integrated with Maven, Jenkins and Travis. Specifically
>>>>>>>>>
>>>>>>>> the
>>>>
>>>>> impact
>>>>>>>>
>>>>>>>>> to the testing infrastructure would be:
>>>>>>>>> - There will be two additional test configurations in Travis.
>>>>>>>>>
>>>>>>>> Since
>>>>
>>>>> Travis
>>>>>>>>
>>>>>>>>> runs all configurations in parallel there should not be a
>>>>>>>>>
>>>>>>>> noticeable
>>>>
>>>>> change
>>>>>>>>
>>>>>>>>> in the Travis run time.
>>>>>>>>> - Jenkins pre-commit test will start running the Python SDK tests.
>>>>>>>>>
>>>>>>>> It
>>>>>
>>>>>> will
>>>>>>>>
>>>>>>>>> add an additional 5 minutes to the completion time of pre-commit
>>>>>>>>>
>>>>>>>> test.
>>>>>
>>>>>> Historically Python SDK tests were not flaky and did not cause any
>>>>>>>>>
>>>>>>>> random
>>>>>>>
>>>>>>>> failures.
>>>>>>>>> - Jenkins Python post-commit test is already separated from the
>>>>>>>>>
>>>>>>>> other
>>>>>
>>>>>> post-commit tests and will continue to exist. It would not change
>>>>>>>>>
>>>>>>>> the
>>>>>
>>>>>> testing time for any other test.
>>>>>>>>>
>>>>>>>>> * The release process needs to be updated to accommodate releasing
>>>>>>>>>
>>>>>>>> Python
>>>>>>>
>>>>>>>> artifacts. Python SDK would fit in the existing release schedule
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>> could
>>>>>>>>
>>>>>>>>> be released along with the Java SDK. The additional steps would
>>>>>>>>>
>>>>>>>> include:
>>>>>>>
>>>>>>>> - Generating Python artifacts. This could be done with a single
>>>>>>>>>
>>>>>>>> command
>>>>>
>>>>>> using Maven today.
>>>>>>>>> - Publishing the artifacts to a central repository such as PyPI.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I'm more than happy to help on this. We left on purpose some things
>>>>>>>>
>>>>>>> open
>>>>>
>>>>>> when we added Maven support to the Python build.
>>>>>>>>
>>>>>>>>
>>>>>>> That would be awesome. We can coordinate on that post-merge.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> - Updating the release guide to reflect the changes above.
>>>>>>>>>
>>>>>>>>> * Users: There are existing users using the Python SDK. To give a
>>>>>>>>>
>>>>>>>> rough
>>>>>
>>>>>> estimate, a distribution of the Beam Python SDK had a total of 23K
>>>>>>>>> downloads in the past 6 months [6]. Some of those users are
>>>>>>>>>
>>>>>>>> already
>>>>
>>>>> engaged
>>>>>>>>
>>>>>>>>> with the community (e.g. [7]). There might be an increased amount
>>>>>>>>> engagement from the rest of them after the merge.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Python 3 support is something we definitively need to look ahead.
>>>>>>>>
>>>>>>> I'd
>>>>
>>>>> try
>>>>>
>>>>>> to make the codebase compatible with both 2.7.x and 3.6.x, rather
>>>>>>>>
>>>>>>> than
>>>>
>>>>> using other  solutions like 2to3.
>>>>>>>>
>>>>>>>>
>>>>>>> I agree with you. I think it makes more sense to make codebase
>>>>>>>
>>>>>> compatible
>>>>>
>>>>>> with both. As you mentioned Python 3 support is not a short-term goal
>>>>>>>
>>>>>> in
>>>>
>>>>> the roadmap, and we can discuss it more as we approach that.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Looking forward to hearing your thoughts and comments on
>>>>>>>>
>>>>>>> “graduating”
>>>>
>>>>> python-sdk to the master.
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Ahmet
>>>>>>>>>
>>>>>>>>> (*) Python SDK branch currently has a diverse group of
>>>>>>>>>
>>>>>>>> contributors.
>>>>
>>>>> Regular contributors include Charles Chen, Chamikara Jayalath,
>>>>>>>>>
>>>>>>>> María
>>>>
>>>>> García
>>>>>>>>
>>>>>>>>> Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam
>>>>>>>>>
>>>>>>>> PMC),
>>>>
>>>>> Sourabh Bajaj, and Vikas Kedigehalli. We have also had
>>>>>>>>>
>>>>>>>> contributions
>>>>
>>>>> from
>>>>>>>
>>>>>>>> Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee,
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>> Younghee Kwon.
>>>>>>>>>
>>>>>>>>> [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
>>>>>>>>> [2] https://beam.apache.org/documentation/programming-guide/
>>>>>>>>> [3] https://issues.apache.org/jira/browse/BEAM-1265
>>>>>>>>> [4]
>>>>>>>>> https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
>>>>>>>>> en%20AND%20labels%20%3D%20sdk-consistency
>>>>>>>>> [5] https://issues.apache.org/jira/browse/BEAM-1218
>>>>>>>>> [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
>>>>>>>>> [7] https://issues.apache.org/jira/browse/BEAM-1251
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Great summary, Ahmet. Thanks.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sergio Fernández
>>>>>>>> Partner Technology Manager
>>>>>>>> Redlink GmbH
>>>>>>>> m: +43 6602747925
>>>>>>>> e: sergio.fernandez@redlink.co
>>>>>>>> w: http://redlink.co
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernandez@redlink.co
w: http://redlink.co

Re: [DISCUSS] Python SDK status and next steps

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Just to be clear, the prerequisite to be able to build the Python SDK are:

apt-get install python-setuptools
apt-get install python-pip

It's also required by the default "regular" build.

Regards
JB

On 01/31/2017 11:02 AM, Jean-Baptiste Onofr� wrote:
> Just one thing I noticed (and can be helpful for others): to build Beam
> we now need python setuptools installed.
>
> For instance, on Ubuntu, you have to do:
>
> apt-get install python-setuptools
>
> Same for the pip distribution.
>
> I guess (if not already done), we have to update README/Building
> instructions.
>
> Correct ?
>
> Regards
> JB
>
> On 01/31/2017 08:10 AM, Ahmet Altay wrote:
>> Hi all,
>>
>> This merge is completed. Python SDK is now officially part of the master
>> branch! Thank you all for the support. Please open an issue, if you
>> notice
>> a reference to the now obsolete python-sdk branch in the documentation.
>>
>> There will not be any more merges to the python-sdk branch. Going forward
>> please use the master branch for Python SDK development. There are a few
>> existing open PRs to the python-sdk [1]. If you are the author of one of
>> those PRs, please rebase them on top of master.
>>
>> Thank you,
>> Ahmet
>>
>> [1] https://github.com/pulls?utf8=\u2713&q=is%3Aopen+is%3Apr+base%
>> 3Apython-sdk+repo%3Aapache%2Fbeam+
>> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>>
>>
>> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles
>> <kl...@google.com.invalid>
>> wrote:
>>
>>> To clarify the implied criteria of that last exchange, it is "An SDK
>>> should
>>> have at least one runner that can execute the complete model (may be a
>>> direct runner)"
>>>
>>> I want to highlight this, because whether an _SDK_ supports unbounded
>>> data
>>> is not particularly well-defined, and will evolve:
>>>
>>>  - With the Runner API, an SDK will need to support building a graph
>>> with
>>> unbounded constructs, as today with probably minimal changes.
>>>
>>>  - With the Fn API, if any part of the Fn API is specific to unbounded
>>> data, the SDK will need to implement it. I think right now there is
>>> no such
>>> thing, and we don't want such a thing, so SDKs implementing the Fn API
>>> automatically support unbounded data.
>>>
>>>  - There will also likely be an SDK-specific shim just as there is
>>> today,
>>> to leverage idiomatic deserialized representations. The richness of this
>>> shim will decrease so that it will need to "support" unbounded data but
>>> that will be a ~one liner.
>>>
>>> Getting the Python SDK on master will accelerate our progress towards
>>> the
>>> Fn API - partly technical, partly community - which is the best path
>>> towards support for unbounded data across multiple runners. I think the
>>> criteria are written with the completed portability framework in
>>> mind. So
>>> this exchange makes me actually more convinced we should merge
>>> python-sdk
>>> to master.
>>>
>>> On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
>>> robertwb@google.com.invalid> wrote:
>>>
>>>> On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>>>> <dh...@google.com.invalid> wrote:
>>>>> I do not think that Python SDK yet meets the bar [1] for implementing
>>> the
>>>>> Beam model -- supporting Unbounded data is very important. That said,
>>>> given
>>>>> the committed and sustained set of contributors, it generally makes
>>> sense
>>>>> to me to make an exception in anticipation of these features being
>>>> fleshed
>>>>> out soon; including potentially new users/contributors that would
>>> arrive
>>>>> once in master.
>>>>>
>>>>> [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
>>>>> k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com
>>>>
>>>> That is a valid point. The Python SDK supports all the unbounded parts
>>>> of the model except for unbounded sources, which was deferred while
>>>> seeing how https://s.apache.org/splittable-do-fn played out. I've been
>>>> working with the team and merging/reviewing most of their code, and
>>>> have full confidence this will be coming (and on that note can vouch
>>>> for a healthy community and support which are much harder to add
>>>> later).
>>>>
>>>> In short, I think it has the required maturity, and I'm in favor of
>>>> merging soonish.
>>>>
>>>>> On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
>>>>> <altay@google.com.invalid
>>>>
>>>>> wrote:
>>>>>
>>>>>> Thank you all for the comments so far. I would follow the process as
>>>>>> suggested by Davor and others in this thread.
>>>>>>
>>>>>> Ahmet
>>>>>>
>>>>>> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fern�ndez <wikier@apache.org
>>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
>>> <altay@google.com.invalid
>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> tl;dr: I would like to start a discussion about merging python-sdk
>>>>>> branch
>>>>>>>> to master branch. Python SDK is mature enough and merging it to
>>>> master
>>>>>>> will
>>>>>>>> accelerate its development and adoption.
>>>>>>>>
>>>>>>>
>>>>>>> Good point, Ahmet!
>>>>>>>
>>>>>>> I've following closed the development since it was imported in June.
>>>> For
>>>>>>> the prototypes I've implemented so far it works quite well; I guess
>>>> we'd
>>>>>>> just need to focus the next months in bringing more runners support.
>>>>>>>
>>>>>>> With a great effort from a lot of contributors(*), Python SDK [1] is
>>>> now
>>>>>> a
>>>>>>>> mostly complete, tested, performant Python implementation of the
>>>> Beam
>>>>>>>> model. Since June, when we first started with Python SDK in Apache
>>>> Beam
>>>>>>> we
>>>>>>>> have been continuously improving it.
>>>>>>>>
>>>>>>>
>>>>>>> I wouldn't merge during the preparation of 0.5.0 release, but after
>>>> that
>>>>>>> could be a good time to merge back into master.
>>>>>>>
>>>>>>>
>>>>>>> ** Python SDK currently supports:
>>>>>>>>
>>>>>>>> * Model: All main concepts are present (ParDo, GroupByKey,
>>> Windowing
>>>>>>> etc.).
>>>>>>>> * IO: There are extensible APIs for writing new bounded sources
>>> and
>>>>>>> sinks.
>>>>>>>> Implementations are provided for Text, Avro, BigQuery, and
>>>> Datastore.
>>>>>>>> * Runners: Python SDK has an extensible base runner module that
>>>> allows
>>>>>>>> building specific runners on top of it. The SDK comes with two
>>>> pipeline
>>>>>>>> runners: DirectRunner and DataflowRunner; and it is possible to
>>> add
>>>>>> more.
>>>>>>>> The existing runners are currently limited to bounded execution
>>> and
>>>>>>>> otherwise equivalent to their Java SDK counterparts in
>>>> functionality.
>>>>>>>>
>>>>>>>
>>>>>>> What would the effort of porting, and maintaining, parallel versions
>>>> of
>>>>>> the
>>>>>>> Java runners? I guess I'd need to dig deeper in the model, but this
>>>> may
>>>>>>> represent a major effort for the project, right?
>>>>>>>
>>>>>>
>>>>>> It is somewhat higher for DirectRunner because DirectRunner also
>>>> implements
>>>>>> the code for execution. It is not that high for DataflowRunner
>>>>>> because
>>>> the
>>>>>> base runner module has a lot of helpers with the right hooks for
>>>>>> implementing a generic runner. I would _expect_ the experience in
>>>> general
>>>>>> would be similar to the latter.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> * Testing: Python SDK implements ValidatesRunner test framework
>>> for
>>>>>>>> implementing integration test for current and future runners.
>>> There
>>>> is
>>>>>>> unit
>>>>>>>> test coverage for all modules, and a number of integrations test
>>> for
>>>>>>>> validating existing runners.
>>>>>>>> * Documentation and examples: Documentation work has started on
>>>> Python
>>>>>>> SDK.
>>>>>>>> Beam Programming Guide page has been updated to include Python
>>> [2].
>>>> The
>>>>>>>> code comes with many ready to use examples and we are in a good
>>>> place
>>>>>> to
>>>>>>>> start documenting those on the website.
>>>>>>>>
>>>>>>>> ** We are not done yet, next on the roadmap we have:
>>>>>>>>
>>>>>>>> * Streaming: Both of the existing runners lack support for
>>> streaming
>>>>>>>> execution, and currently there is work going on for adding
>>> streaming
>>>>>>>> support to DirectRunner [3].
>>>>>>>> * Documentation: Filling the rest of the Beam documentations with
>>>>>> Python
>>>>>>>> SDK specific information and examples.
>>>>>>>> * SDK consistency: Making Python SDK consistent with the Java SDK.
>>>> We
>>>>>>> have
>>>>>>>> come a long way on this and have only a few items left [4].
>>>>>>>> * Beamifying: We have been working on removing Dataflow-specific
>>>>>>> references
>>>>>>>> both from the documentation and from the code. There is some work
>>>> left,
>>>>>>> and
>>>>>>>> we are currently working on those as well [5].
>>>>>>>>
>>>>>>>> ** Steps and implications of merging to master:
>>>>>>>>
>>>>>>>> * Master branch is merged to python-sdk branch at regular
>>> intervals
>>>> and
>>>>>>> the
>>>>>>>> last merge was on 12/22. All the past merges were uneventful
>>> because
>>>>>>> there
>>>>>>>> is a minimal overlap in modified files between branches.
>>> Integrating
>>>>>>>> python-sdk to master will similarly touch a small number of
>>> existing
>>>>>>> files.
>>>>>>>>
>>>>>>>> * Python SDK is using the same tools for building and testing. It
>>> is
>>>>>>>> already integrated with Maven, Jenkins and Travis. Specifically
>>> the
>>>>>>> impact
>>>>>>>> to the testing infrastructure would be:
>>>>>>>> - There will be two additional test configurations in Travis.
>>> Since
>>>>>>> Travis
>>>>>>>> runs all configurations in parallel there should not be a
>>> noticeable
>>>>>>> change
>>>>>>>> in the Travis run time.
>>>>>>>> - Jenkins pre-commit test will start running the Python SDK tests.
>>>> It
>>>>>>> will
>>>>>>>> add an additional 5 minutes to the completion time of pre-commit
>>>> test.
>>>>>>>> Historically Python SDK tests were not flaky and did not cause any
>>>>>> random
>>>>>>>> failures.
>>>>>>>> - Jenkins Python post-commit test is already separated from the
>>>> other
>>>>>>>> post-commit tests and will continue to exist. It would not change
>>>> the
>>>>>>>> testing time for any other test.
>>>>>>>>
>>>>>>>> * The release process needs to be updated to accommodate releasing
>>>>>> Python
>>>>>>>> artifacts. Python SDK would fit in the existing release schedule
>>> and
>>>>>>> could
>>>>>>>> be released along with the Java SDK. The additional steps would
>>>>>> include:
>>>>>>>> - Generating Python artifacts. This could be done with a single
>>>> command
>>>>>>>> using Maven today.
>>>>>>>> - Publishing the artifacts to a central repository such as PyPI.
>>>>>>>>
>>>>>>>
>>>>>>> I'm more than happy to help on this. We left on purpose some things
>>>> open
>>>>>>> when we added Maven support to the Python build.
>>>>>>>
>>>>>>
>>>>>> That would be awesome. We can coordinate on that post-merge.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> - Updating the release guide to reflect the changes above.
>>>>>>>>
>>>>>>>> * Users: There are existing users using the Python SDK. To give a
>>>> rough
>>>>>>>> estimate, a distribution of the Beam Python SDK had a total of 23K
>>>>>>>> downloads in the past 6 months [6]. Some of those users are
>>> already
>>>>>>> engaged
>>>>>>>> with the community (e.g. [7]). There might be an increased amount
>>>>>>>> engagement from the rest of them after the merge.
>>>>>>>>
>>>>>>>
>>>>>>> Python 3 support is something we definitively need to look ahead.
>>> I'd
>>>> try
>>>>>>> to make the codebase compatible with both 2.7.x and 3.6.x, rather
>>> than
>>>>>>> using other  solutions like 2to3.
>>>>>>>
>>>>>>
>>>>>> I agree with you. I think it makes more sense to make codebase
>>>> compatible
>>>>>> with both. As you mentioned Python 3 support is not a short-term goal
>>> in
>>>>>> the roadmap, and we can discuss it more as we approach that.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Looking forward to hearing your thoughts and comments on
>>> \u201cgraduating\u201d
>>>>>>>> python-sdk to the master.
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Ahmet
>>>>>>>>
>>>>>>>> (*) Python SDK branch currently has a diverse group of
>>> contributors.
>>>>>>>> Regular contributors include Charles Chen, Chamikara Jayalath,
>>> Mar�a
>>>>>>> Garc�a
>>>>>>>> Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam
>>> PMC),
>>>>>>>> Sourabh Bajaj, and Vikas Kedigehalli. We have also had
>>> contributions
>>>>>> from
>>>>>>>> Abdullah Bashir, Marco Buccini, Sergio Fern�ndez, Seunghyun Lee,
>>> and
>>>>>>>> Younghee Kwon.
>>>>>>>>
>>>>>>>> [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
>>>>>>>> [2] https://beam.apache.org/documentation/programming-guide/
>>>>>>>> [3] https://issues.apache.org/jira/browse/BEAM-1265
>>>>>>>> [4]
>>>>>>>> https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
>>>>>>>> en%20AND%20labels%20%3D%20sdk-consistency
>>>>>>>> [5] https://issues.apache.org/jira/browse/BEAM-1218
>>>>>>>> [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
>>>>>>>> [7] https://issues.apache.org/jira/browse/BEAM-1251
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Great summary, Ahmet. Thanks.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> --
>>>>>>> Sergio Fern�ndez
>>>>>>> Partner Technology Manager
>>>>>>> Redlink GmbH
>>>>>>> m: +43 6602747925
>>>>>>> e: sergio.fernandez@redlink.co
>>>>>>> w: http://redlink.co
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSS] Python SDK status and next steps

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Just one thing I noticed (and can be helpful for others): to build Beam 
we now need python setuptools installed.

For instance, on Ubuntu, you have to do:

apt-get install python-setuptools

Same for the pip distribution.

I guess (if not already done), we have to update README/Building 
instructions.

Correct ?

Regards
JB

On 01/31/2017 08:10 AM, Ahmet Altay wrote:
> Hi all,
>
> This merge is completed. Python SDK is now officially part of the master
> branch! Thank you all for the support. Please open an issue, if you notice
> a reference to the now obsolete python-sdk branch in the documentation.
>
> There will not be any more merges to the python-sdk branch. Going forward
> please use the master branch for Python SDK development. There are a few
> existing open PRs to the python-sdk [1]. If you are the author of one of
> those PRs, please rebase them on top of master.
>
> Thank you,
> Ahmet
>
> [1] https://github.com/pulls?utf8=\u2713&q=is%3Aopen+is%3Apr+base%
> 3Apython-sdk+repo%3Aapache%2Fbeam+
> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>
> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles <kl...@google.com.invalid>
> wrote:
>
>> To clarify the implied criteria of that last exchange, it is "An SDK should
>> have at least one runner that can execute the complete model (may be a
>> direct runner)"
>>
>> I want to highlight this, because whether an _SDK_ supports unbounded data
>> is not particularly well-defined, and will evolve:
>>
>>  - With the Runner API, an SDK will need to support building a graph with
>> unbounded constructs, as today with probably minimal changes.
>>
>>  - With the Fn API, if any part of the Fn API is specific to unbounded
>> data, the SDK will need to implement it. I think right now there is no such
>> thing, and we don't want such a thing, so SDKs implementing the Fn API
>> automatically support unbounded data.
>>
>>  - There will also likely be an SDK-specific shim just as there is today,
>> to leverage idiomatic deserialized representations. The richness of this
>> shim will decrease so that it will need to "support" unbounded data but
>> that will be a ~one liner.
>>
>> Getting the Python SDK on master will accelerate our progress towards the
>> Fn API - partly technical, partly community - which is the best path
>> towards support for unbounded data across multiple runners. I think the
>> criteria are written with the completed portability framework in mind. So
>> this exchange makes me actually more convinced we should merge python-sdk
>> to master.
>>
>> On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
>> robertwb@google.com.invalid> wrote:
>>
>>> On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
>>> <dh...@google.com.invalid> wrote:
>>>> I do not think that Python SDK yet meets the bar [1] for implementing
>> the
>>>> Beam model -- supporting Unbounded data is very important. That said,
>>> given
>>>> the committed and sustained set of contributors, it generally makes
>> sense
>>>> to me to make an exception in anticipation of these features being
>>> fleshed
>>>> out soon; including potentially new users/contributors that would
>> arrive
>>>> once in master.
>>>>
>>>> [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
>>>> k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com
>>>
>>> That is a valid point. The Python SDK supports all the unbounded parts
>>> of the model except for unbounded sources, which was deferred while
>>> seeing how https://s.apache.org/splittable-do-fn played out. I've been
>>> working with the team and merging/reviewing most of their code, and
>>> have full confidence this will be coming (and on that note can vouch
>>> for a healthy community and support which are much harder to add
>>> later).
>>>
>>> In short, I think it has the required maturity, and I'm in favor of
>>> merging soonish.
>>>
>>>> On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay <altay@google.com.invalid
>>>
>>>> wrote:
>>>>
>>>>> Thank you all for the comments so far. I would follow the process as
>>>>> suggested by Davor and others in this thread.
>>>>>
>>>>> Ahmet
>>>>>
>>>>> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fern�ndez <wikier@apache.org
>>>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
>> <altay@google.com.invalid
>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>> tl;dr: I would like to start a discussion about merging python-sdk
>>>>> branch
>>>>>>> to master branch. Python SDK is mature enough and merging it to
>>> master
>>>>>> will
>>>>>>> accelerate its development and adoption.
>>>>>>>
>>>>>>
>>>>>> Good point, Ahmet!
>>>>>>
>>>>>> I've following closed the development since it was imported in June.
>>> For
>>>>>> the prototypes I've implemented so far it works quite well; I guess
>>> we'd
>>>>>> just need to focus the next months in bringing more runners support.
>>>>>>
>>>>>> With a great effort from a lot of contributors(*), Python SDK [1] is
>>> now
>>>>> a
>>>>>>> mostly complete, tested, performant Python implementation of the
>>> Beam
>>>>>>> model. Since June, when we first started with Python SDK in Apache
>>> Beam
>>>>>> we
>>>>>>> have been continuously improving it.
>>>>>>>
>>>>>>
>>>>>> I wouldn't merge during the preparation of 0.5.0 release, but after
>>> that
>>>>>> could be a good time to merge back into master.
>>>>>>
>>>>>>
>>>>>> ** Python SDK currently supports:
>>>>>>>
>>>>>>> * Model: All main concepts are present (ParDo, GroupByKey,
>> Windowing
>>>>>> etc.).
>>>>>>> * IO: There are extensible APIs for writing new bounded sources
>> and
>>>>>> sinks.
>>>>>>> Implementations are provided for Text, Avro, BigQuery, and
>>> Datastore.
>>>>>>> * Runners: Python SDK has an extensible base runner module that
>>> allows
>>>>>>> building specific runners on top of it. The SDK comes with two
>>> pipeline
>>>>>>> runners: DirectRunner and DataflowRunner; and it is possible to
>> add
>>>>> more.
>>>>>>> The existing runners are currently limited to bounded execution
>> and
>>>>>>> otherwise equivalent to their Java SDK counterparts in
>>> functionality.
>>>>>>>
>>>>>>
>>>>>> What would the effort of porting, and maintaining, parallel versions
>>> of
>>>>> the
>>>>>> Java runners? I guess I'd need to dig deeper in the model, but this
>>> may
>>>>>> represent a major effort for the project, right?
>>>>>>
>>>>>
>>>>> It is somewhat higher for DirectRunner because DirectRunner also
>>> implements
>>>>> the code for execution. It is not that high for DataflowRunner because
>>> the
>>>>> base runner module has a lot of helpers with the right hooks for
>>>>> implementing a generic runner. I would _expect_ the experience in
>>> general
>>>>> would be similar to the latter.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> * Testing: Python SDK implements ValidatesRunner test framework
>> for
>>>>>>> implementing integration test for current and future runners.
>> There
>>> is
>>>>>> unit
>>>>>>> test coverage for all modules, and a number of integrations test
>> for
>>>>>>> validating existing runners.
>>>>>>> * Documentation and examples: Documentation work has started on
>>> Python
>>>>>> SDK.
>>>>>>> Beam Programming Guide page has been updated to include Python
>> [2].
>>> The
>>>>>>> code comes with many ready to use examples and we are in a good
>>> place
>>>>> to
>>>>>>> start documenting those on the website.
>>>>>>>
>>>>>>> ** We are not done yet, next on the roadmap we have:
>>>>>>>
>>>>>>> * Streaming: Both of the existing runners lack support for
>> streaming
>>>>>>> execution, and currently there is work going on for adding
>> streaming
>>>>>>> support to DirectRunner [3].
>>>>>>> * Documentation: Filling the rest of the Beam documentations with
>>>>> Python
>>>>>>> SDK specific information and examples.
>>>>>>> * SDK consistency: Making Python SDK consistent with the Java SDK.
>>> We
>>>>>> have
>>>>>>> come a long way on this and have only a few items left [4].
>>>>>>> * Beamifying: We have been working on removing Dataflow-specific
>>>>>> references
>>>>>>> both from the documentation and from the code. There is some work
>>> left,
>>>>>> and
>>>>>>> we are currently working on those as well [5].
>>>>>>>
>>>>>>> ** Steps and implications of merging to master:
>>>>>>>
>>>>>>> * Master branch is merged to python-sdk branch at regular
>> intervals
>>> and
>>>>>> the
>>>>>>> last merge was on 12/22. All the past merges were uneventful
>> because
>>>>>> there
>>>>>>> is a minimal overlap in modified files between branches.
>> Integrating
>>>>>>> python-sdk to master will similarly touch a small number of
>> existing
>>>>>> files.
>>>>>>>
>>>>>>> * Python SDK is using the same tools for building and testing. It
>> is
>>>>>>> already integrated with Maven, Jenkins and Travis. Specifically
>> the
>>>>>> impact
>>>>>>> to the testing infrastructure would be:
>>>>>>> - There will be two additional test configurations in Travis.
>> Since
>>>>>> Travis
>>>>>>> runs all configurations in parallel there should not be a
>> noticeable
>>>>>> change
>>>>>>> in the Travis run time.
>>>>>>> - Jenkins pre-commit test will start running the Python SDK tests.
>>> It
>>>>>> will
>>>>>>> add an additional 5 minutes to the completion time of pre-commit
>>> test.
>>>>>>> Historically Python SDK tests were not flaky and did not cause any
>>>>> random
>>>>>>> failures.
>>>>>>> - Jenkins Python post-commit test is already separated from the
>>> other
>>>>>>> post-commit tests and will continue to exist. It would not change
>>> the
>>>>>>> testing time for any other test.
>>>>>>>
>>>>>>> * The release process needs to be updated to accommodate releasing
>>>>> Python
>>>>>>> artifacts. Python SDK would fit in the existing release schedule
>> and
>>>>>> could
>>>>>>> be released along with the Java SDK. The additional steps would
>>>>> include:
>>>>>>> - Generating Python artifacts. This could be done with a single
>>> command
>>>>>>> using Maven today.
>>>>>>> - Publishing the artifacts to a central repository such as PyPI.
>>>>>>>
>>>>>>
>>>>>> I'm more than happy to help on this. We left on purpose some things
>>> open
>>>>>> when we added Maven support to the Python build.
>>>>>>
>>>>>
>>>>> That would be awesome. We can coordinate on that post-merge.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> - Updating the release guide to reflect the changes above.
>>>>>>>
>>>>>>> * Users: There are existing users using the Python SDK. To give a
>>> rough
>>>>>>> estimate, a distribution of the Beam Python SDK had a total of 23K
>>>>>>> downloads in the past 6 months [6]. Some of those users are
>> already
>>>>>> engaged
>>>>>>> with the community (e.g. [7]). There might be an increased amount
>>>>>>> engagement from the rest of them after the merge.
>>>>>>>
>>>>>>
>>>>>> Python 3 support is something we definitively need to look ahead.
>> I'd
>>> try
>>>>>> to make the codebase compatible with both 2.7.x and 3.6.x, rather
>> than
>>>>>> using other  solutions like 2to3.
>>>>>>
>>>>>
>>>>> I agree with you. I think it makes more sense to make codebase
>>> compatible
>>>>> with both. As you mentioned Python 3 support is not a short-term goal
>> in
>>>>> the roadmap, and we can discuss it more as we approach that.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Looking forward to hearing your thoughts and comments on
>> \u201cgraduating\u201d
>>>>>>> python-sdk to the master.
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Ahmet
>>>>>>>
>>>>>>> (*) Python SDK branch currently has a diverse group of
>> contributors.
>>>>>>> Regular contributors include Charles Chen, Chamikara Jayalath,
>> Mar�a
>>>>>> Garc�a
>>>>>>> Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam
>> PMC),
>>>>>>> Sourabh Bajaj, and Vikas Kedigehalli. We have also had
>> contributions
>>>>> from
>>>>>>> Abdullah Bashir, Marco Buccini, Sergio Fern�ndez, Seunghyun Lee,
>> and
>>>>>>> Younghee Kwon.
>>>>>>>
>>>>>>> [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
>>>>>>> [2] https://beam.apache.org/documentation/programming-guide/
>>>>>>> [3] https://issues.apache.org/jira/browse/BEAM-1265
>>>>>>> [4]
>>>>>>> https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
>>>>>>> en%20AND%20labels%20%3D%20sdk-consistency
>>>>>>> [5] https://issues.apache.org/jira/browse/BEAM-1218
>>>>>>> [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
>>>>>>> [7] https://issues.apache.org/jira/browse/BEAM-1251
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Great summary, Ahmet. Thanks.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> --
>>>>>> Sergio Fern�ndez
>>>>>> Partner Technology Manager
>>>>>> Redlink GmbH
>>>>>> m: +43 6602747925
>>>>>> e: sergio.fernandez@redlink.co
>>>>>> w: http://redlink.co
>>>>>>
>>>>>
>>>
>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSS] Python SDK status and next steps

Posted by Davor Bonaci <da...@apache.org>.
Great -- congratulations to everyone who has contributed to the Python SDK!

On Mon, Jan 30, 2017 at 11:10 PM, Ahmet Altay <al...@google.com.invalid>
wrote:

> Hi all,
>
> This merge is completed. Python SDK is now officially part of the master
> branch! Thank you all for the support. Please open an issue, if you notice
> a reference to the now obsolete python-sdk branch in the documentation.
>
> There will not be any more merges to the python-sdk branch. Going forward
> please use the master branch for Python SDK development. There are a few
> existing open PRs to the python-sdk [1]. If you are the author of one of
> those PRs, please rebase them on top of master.
>
> Thank you,
> Ahmet
>
> [1] https://github.com/pulls?utf8=✓&q=is%3Aopen+is%3Apr+base%
> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25>
> 3Apython-sdk+repo%3Aapache%2Fbeam+
> <https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%
> 3Apr+base%3Apython-sdk+repo%3Aapache%2Fbeam+>
>
> On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles <kl...@google.com.invalid>
> wrote:
>
> > To clarify the implied criteria of that last exchange, it is "An SDK
> should
> > have at least one runner that can execute the complete model (may be a
> > direct runner)"
> >
> > I want to highlight this, because whether an _SDK_ supports unbounded
> data
> > is not particularly well-defined, and will evolve:
> >
> >  - With the Runner API, an SDK will need to support building a graph with
> > unbounded constructs, as today with probably minimal changes.
> >
> >  - With the Fn API, if any part of the Fn API is specific to unbounded
> > data, the SDK will need to implement it. I think right now there is no
> such
> > thing, and we don't want such a thing, so SDKs implementing the Fn API
> > automatically support unbounded data.
> >
> >  - There will also likely be an SDK-specific shim just as there is today,
> > to leverage idiomatic deserialized representations. The richness of this
> > shim will decrease so that it will need to "support" unbounded data but
> > that will be a ~one liner.
> >
> > Getting the Python SDK on master will accelerate our progress towards the
> > Fn API - partly technical, partly community - which is the best path
> > towards support for unbounded data across multiple runners. I think the
> > criteria are written with the completed portability framework in mind. So
> > this exchange makes me actually more convinced we should merge python-sdk
> > to master.
> >
> > On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
> > robertwb@google.com.invalid> wrote:
> >
> > > On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
> > > <dh...@google.com.invalid> wrote:
> > > > I do not think that Python SDK yet meets the bar [1] for implementing
> > the
> > > > Beam model -- supporting Unbounded data is very important. That said,
> > > given
> > > > the committed and sustained set of contributors, it generally makes
> > sense
> > > > to me to make an exception in anticipation of these features being
> > > fleshed
> > > > out soon; including potentially new users/contributors that would
> > arrive
> > > > once in master.
> > > >
> > > > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> > > > k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com
> > >
> > > That is a valid point. The Python SDK supports all the unbounded parts
> > > of the model except for unbounded sources, which was deferred while
> > > seeing how https://s.apache.org/splittable-do-fn played out. I've been
> > > working with the team and merging/reviewing most of their code, and
> > > have full confidence this will be coming (and on that note can vouch
> > > for a healthy community and support which are much harder to add
> > > later).
> > >
> > > In short, I think it has the required maturity, and I'm in favor of
> > > merging soonish.
> > >
> > > > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay
> <altay@google.com.invalid
> > >
> > > > wrote:
> > > >
> > > >> Thank you all for the comments so far. I would follow the process as
> > > >> suggested by Davor and others in this thread.
> > > >>
> > > >> Ahmet
> > > >>
> > > >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <
> wikier@apache.org
> > >
> > > >> wrote:
> > > >>
> > > >> > Hi
> > > >> >
> > > >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
> > <altay@google.com.invalid
> > > >
> > > >> > wrote:
> > > >> > >
> > > >> > > tl;dr: I would like to start a discussion about merging
> python-sdk
> > > >> branch
> > > >> > > to master branch. Python SDK is mature enough and merging it to
> > > master
> > > >> > will
> > > >> > > accelerate its development and adoption.
> > > >> > >
> > > >> >
> > > >> > Good point, Ahmet!
> > > >> >
> > > >> > I've following closed the development since it was imported in
> June.
> > > For
> > > >> > the prototypes I've implemented so far it works quite well; I
> guess
> > > we'd
> > > >> > just need to focus the next months in bringing more runners
> support.
> > > >> >
> > > >> > With a great effort from a lot of contributors(*), Python SDK [1]
> is
> > > now
> > > >> a
> > > >> > > mostly complete, tested, performant Python implementation of the
> > > Beam
> > > >> > > model. Since June, when we first started with Python SDK in
> Apache
> > > Beam
> > > >> > we
> > > >> > > have been continuously improving it.
> > > >> > >
> > > >> >
> > > >> > I wouldn't merge during the preparation of 0.5.0 release, but
> after
> > > that
> > > >> > could be a good time to merge back into master.
> > > >> >
> > > >> >
> > > >> > ** Python SDK currently supports:
> > > >> > >
> > > >> > > * Model: All main concepts are present (ParDo, GroupByKey,
> > Windowing
> > > >> > etc.).
> > > >> > > * IO: There are extensible APIs for writing new bounded sources
> > and
> > > >> > sinks.
> > > >> > > Implementations are provided for Text, Avro, BigQuery, and
> > > Datastore.
> > > >> > > * Runners: Python SDK has an extensible base runner module that
> > > allows
> > > >> > > building specific runners on top of it. The SDK comes with two
> > > pipeline
> > > >> > > runners: DirectRunner and DataflowRunner; and it is possible to
> > add
> > > >> more.
> > > >> > > The existing runners are currently limited to bounded execution
> > and
> > > >> > > otherwise equivalent to their Java SDK counterparts in
> > > functionality.
> > > >> > >
> > > >> >
> > > >> > What would the effort of porting, and maintaining, parallel
> versions
> > > of
> > > >> the
> > > >> > Java runners? I guess I'd need to dig deeper in the model, but
> this
> > > may
> > > >> > represent a major effort for the project, right?
> > > >> >
> > > >>
> > > >> It is somewhat higher for DirectRunner because DirectRunner also
> > > implements
> > > >> the code for execution. It is not that high for DataflowRunner
> because
> > > the
> > > >> base runner module has a lot of helpers with the right hooks for
> > > >> implementing a generic runner. I would _expect_ the experience in
> > > general
> > > >> would be similar to the latter.
> > > >>
> > > >>
> > > >> >
> > > >> >
> > > >> >
> > > >> > > * Testing: Python SDK implements ValidatesRunner test framework
> > for
> > > >> > > implementing integration test for current and future runners.
> > There
> > > is
> > > >> > unit
> > > >> > > test coverage for all modules, and a number of integrations test
> > for
> > > >> > > validating existing runners.
> > > >> > > * Documentation and examples: Documentation work has started on
> > > Python
> > > >> > SDK.
> > > >> > > Beam Programming Guide page has been updated to include Python
> > [2].
> > > The
> > > >> > > code comes with many ready to use examples and we are in a good
> > > place
> > > >> to
> > > >> > > start documenting those on the website.
> > > >> > >
> > > >> > > ** We are not done yet, next on the roadmap we have:
> > > >> > >
> > > >> > > * Streaming: Both of the existing runners lack support for
> > streaming
> > > >> > > execution, and currently there is work going on for adding
> > streaming
> > > >> > > support to DirectRunner [3].
> > > >> > > * Documentation: Filling the rest of the Beam documentations
> with
> > > >> Python
> > > >> > > SDK specific information and examples.
> > > >> > > * SDK consistency: Making Python SDK consistent with the Java
> SDK.
> > > We
> > > >> > have
> > > >> > > come a long way on this and have only a few items left [4].
> > > >> > > * Beamifying: We have been working on removing Dataflow-specific
> > > >> > references
> > > >> > > both from the documentation and from the code. There is some
> work
> > > left,
> > > >> > and
> > > >> > > we are currently working on those as well [5].
> > > >> > >
> > > >> > > ** Steps and implications of merging to master:
> > > >> > >
> > > >> > > * Master branch is merged to python-sdk branch at regular
> > intervals
> > > and
> > > >> > the
> > > >> > > last merge was on 12/22. All the past merges were uneventful
> > because
> > > >> > there
> > > >> > > is a minimal overlap in modified files between branches.
> > Integrating
> > > >> > > python-sdk to master will similarly touch a small number of
> > existing
> > > >> > files.
> > > >> > >
> > > >> > > * Python SDK is using the same tools for building and testing.
> It
> > is
> > > >> > > already integrated with Maven, Jenkins and Travis. Specifically
> > the
> > > >> > impact
> > > >> > > to the testing infrastructure would be:
> > > >> > > - There will be two additional test configurations in Travis.
> > Since
> > > >> > Travis
> > > >> > > runs all configurations in parallel there should not be a
> > noticeable
> > > >> > change
> > > >> > > in the Travis run time.
> > > >> > > - Jenkins pre-commit test will start running the Python SDK
> tests.
> > > It
> > > >> > will
> > > >> > > add an additional 5 minutes to the completion time of pre-commit
> > > test.
> > > >> > > Historically Python SDK tests were not flaky and did not cause
> any
> > > >> random
> > > >> > > failures.
> > > >> > > - Jenkins Python post-commit test is already separated from the
> > > other
> > > >> > > post-commit tests and will continue to exist. It would not
> change
> > > the
> > > >> > > testing time for any other test.
> > > >> > >
> > > >> > > * The release process needs to be updated to accommodate
> releasing
> > > >> Python
> > > >> > > artifacts. Python SDK would fit in the existing release schedule
> > and
> > > >> > could
> > > >> > > be released along with the Java SDK. The additional steps would
> > > >> include:
> > > >> > > - Generating Python artifacts. This could be done with a single
> > > command
> > > >> > > using Maven today.
> > > >> > > - Publishing the artifacts to a central repository such as PyPI.
> > > >> > >
> > > >> >
> > > >> > I'm more than happy to help on this. We left on purpose some
> things
> > > open
> > > >> > when we added Maven support to the Python build.
> > > >> >
> > > >>
> > > >> That would be awesome. We can coordinate on that post-merge.
> > > >>
> > > >>
> > > >> >
> > > >> >
> > > >> >
> > > >> > > - Updating the release guide to reflect the changes above.
> > > >> > >
> > > >> > > * Users: There are existing users using the Python SDK. To give
> a
> > > rough
> > > >> > > estimate, a distribution of the Beam Python SDK had a total of
> 23K
> > > >> > > downloads in the past 6 months [6]. Some of those users are
> > already
> > > >> > engaged
> > > >> > > with the community (e.g. [7]). There might be an increased
> amount
> > > >> > > engagement from the rest of them after the merge.
> > > >> > >
> > > >> >
> > > >> > Python 3 support is something we definitively need to look ahead.
> > I'd
> > > try
> > > >> > to make the codebase compatible with both 2.7.x and 3.6.x, rather
> > than
> > > >> > using other  solutions like 2to3.
> > > >> >
> > > >>
> > > >> I agree with you. I think it makes more sense to make codebase
> > > compatible
> > > >> with both. As you mentioned Python 3 support is not a short-term
> goal
> > in
> > > >> the roadmap, and we can discuss it more as we approach that.
> > > >>
> > > >>
> > > >> >
> > > >> >
> > > >> > Looking forward to hearing your thoughts and comments on
> > “graduating”
> > > >> > > python-sdk to the master.
> > > >> > >
> > > >> > > Thank you,
> > > >> > > Ahmet
> > > >> > >
> > > >> > > (*) Python SDK branch currently has a diverse group of
> > contributors.
> > > >> > > Regular contributors include Charles Chen, Chamikara Jayalath,
> > María
> > > >> > García
> > > >> > > Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam
> > PMC),
> > > >> > > Sourabh Bajaj, and Vikas Kedigehalli. We have also had
> > contributions
> > > >> from
> > > >> > > Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee,
> > and
> > > >> > > Younghee Kwon.
> > > >> > >
> > > >> > > [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
> > > >> > > [2] https://beam.apache.org/documentation/programming-guide/
> > > >> > > [3] https://issues.apache.org/jira/browse/BEAM-1265
> > > >> > > [4]
> > > >> > > https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
> > > >> > > en%20AND%20labels%20%3D%20sdk-consistency
> > > >> > > [5] https://issues.apache.org/jira/browse/BEAM-1218
> > > >> > > [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
> > > >> > > [7] https://issues.apache.org/jira/browse/BEAM-1251
> > > >> > >
> > > >> >
> > > >> >
> > > >> > Great summary, Ahmet. Thanks.
> > > >> >
> > > >> > Cheers,
> > > >> >
> > > >> > --
> > > >> > Sergio Fernández
> > > >> > Partner Technology Manager
> > > >> > Redlink GmbH
> > > >> > m: +43 6602747925
> > > >> > e: sergio.fernandez@redlink.co
> > > >> > w: http://redlink.co
> > > >> >
> > > >>
> > >
> >
>

Re: [DISCUSS] Python SDK status and next steps

Posted by Ahmet Altay <al...@google.com.INVALID>.
Hi all,

This merge is completed. Python SDK is now officially part of the master
branch! Thank you all for the support. Please open an issue, if you notice
a reference to the now obsolete python-sdk branch in the documentation.

There will not be any more merges to the python-sdk branch. Going forward
please use the master branch for Python SDK development. There are a few
existing open PRs to the python-sdk [1]. If you are the author of one of
those PRs, please rebase them on top of master.

Thank you,
Ahmet

[1] https://github.com/pulls?utf8=✓&q=is%3Aopen+is%3Apr+base%
3Apython-sdk+repo%3Aapache%2Fbeam+
<https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%3Apython-sdk+repo%3Aapache%2Fbeam+>

On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles <kl...@google.com.invalid>
wrote:

> To clarify the implied criteria of that last exchange, it is "An SDK should
> have at least one runner that can execute the complete model (may be a
> direct runner)"
>
> I want to highlight this, because whether an _SDK_ supports unbounded data
> is not particularly well-defined, and will evolve:
>
>  - With the Runner API, an SDK will need to support building a graph with
> unbounded constructs, as today with probably minimal changes.
>
>  - With the Fn API, if any part of the Fn API is specific to unbounded
> data, the SDK will need to implement it. I think right now there is no such
> thing, and we don't want such a thing, so SDKs implementing the Fn API
> automatically support unbounded data.
>
>  - There will also likely be an SDK-specific shim just as there is today,
> to leverage idiomatic deserialized representations. The richness of this
> shim will decrease so that it will need to "support" unbounded data but
> that will be a ~one liner.
>
> Getting the Python SDK on master will accelerate our progress towards the
> Fn API - partly technical, partly community - which is the best path
> towards support for unbounded data across multiple runners. I think the
> criteria are written with the completed portability framework in mind. So
> this exchange makes me actually more convinced we should merge python-sdk
> to master.
>
> On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
> robertwb@google.com.invalid> wrote:
>
> > On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
> > <dh...@google.com.invalid> wrote:
> > > I do not think that Python SDK yet meets the bar [1] for implementing
> the
> > > Beam model -- supporting Unbounded data is very important. That said,
> > given
> > > the committed and sustained set of contributors, it generally makes
> sense
> > > to me to make an exception in anticipation of these features being
> > fleshed
> > > out soon; including potentially new users/contributors that would
> arrive
> > > once in master.
> > >
> > > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> > > k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com
> >
> > That is a valid point. The Python SDK supports all the unbounded parts
> > of the model except for unbounded sources, which was deferred while
> > seeing how https://s.apache.org/splittable-do-fn played out. I've been
> > working with the team and merging/reviewing most of their code, and
> > have full confidence this will be coming (and on that note can vouch
> > for a healthy community and support which are much harder to add
> > later).
> >
> > In short, I think it has the required maturity, and I'm in favor of
> > merging soonish.
> >
> > > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay <altay@google.com.invalid
> >
> > > wrote:
> > >
> > >> Thank you all for the comments so far. I would follow the process as
> > >> suggested by Davor and others in this thread.
> > >>
> > >> Ahmet
> > >>
> > >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <wikier@apache.org
> >
> > >> wrote:
> > >>
> > >> > Hi
> > >> >
> > >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay
> <altay@google.com.invalid
> > >
> > >> > wrote:
> > >> > >
> > >> > > tl;dr: I would like to start a discussion about merging python-sdk
> > >> branch
> > >> > > to master branch. Python SDK is mature enough and merging it to
> > master
> > >> > will
> > >> > > accelerate its development and adoption.
> > >> > >
> > >> >
> > >> > Good point, Ahmet!
> > >> >
> > >> > I've following closed the development since it was imported in June.
> > For
> > >> > the prototypes I've implemented so far it works quite well; I guess
> > we'd
> > >> > just need to focus the next months in bringing more runners support.
> > >> >
> > >> > With a great effort from a lot of contributors(*), Python SDK [1] is
> > now
> > >> a
> > >> > > mostly complete, tested, performant Python implementation of the
> > Beam
> > >> > > model. Since June, when we first started with Python SDK in Apache
> > Beam
> > >> > we
> > >> > > have been continuously improving it.
> > >> > >
> > >> >
> > >> > I wouldn't merge during the preparation of 0.5.0 release, but after
> > that
> > >> > could be a good time to merge back into master.
> > >> >
> > >> >
> > >> > ** Python SDK currently supports:
> > >> > >
> > >> > > * Model: All main concepts are present (ParDo, GroupByKey,
> Windowing
> > >> > etc.).
> > >> > > * IO: There are extensible APIs for writing new bounded sources
> and
> > >> > sinks.
> > >> > > Implementations are provided for Text, Avro, BigQuery, and
> > Datastore.
> > >> > > * Runners: Python SDK has an extensible base runner module that
> > allows
> > >> > > building specific runners on top of it. The SDK comes with two
> > pipeline
> > >> > > runners: DirectRunner and DataflowRunner; and it is possible to
> add
> > >> more.
> > >> > > The existing runners are currently limited to bounded execution
> and
> > >> > > otherwise equivalent to their Java SDK counterparts in
> > functionality.
> > >> > >
> > >> >
> > >> > What would the effort of porting, and maintaining, parallel versions
> > of
> > >> the
> > >> > Java runners? I guess I'd need to dig deeper in the model, but this
> > may
> > >> > represent a major effort for the project, right?
> > >> >
> > >>
> > >> It is somewhat higher for DirectRunner because DirectRunner also
> > implements
> > >> the code for execution. It is not that high for DataflowRunner because
> > the
> > >> base runner module has a lot of helpers with the right hooks for
> > >> implementing a generic runner. I would _expect_ the experience in
> > general
> > >> would be similar to the latter.
> > >>
> > >>
> > >> >
> > >> >
> > >> >
> > >> > > * Testing: Python SDK implements ValidatesRunner test framework
> for
> > >> > > implementing integration test for current and future runners.
> There
> > is
> > >> > unit
> > >> > > test coverage for all modules, and a number of integrations test
> for
> > >> > > validating existing runners.
> > >> > > * Documentation and examples: Documentation work has started on
> > Python
> > >> > SDK.
> > >> > > Beam Programming Guide page has been updated to include Python
> [2].
> > The
> > >> > > code comes with many ready to use examples and we are in a good
> > place
> > >> to
> > >> > > start documenting those on the website.
> > >> > >
> > >> > > ** We are not done yet, next on the roadmap we have:
> > >> > >
> > >> > > * Streaming: Both of the existing runners lack support for
> streaming
> > >> > > execution, and currently there is work going on for adding
> streaming
> > >> > > support to DirectRunner [3].
> > >> > > * Documentation: Filling the rest of the Beam documentations with
> > >> Python
> > >> > > SDK specific information and examples.
> > >> > > * SDK consistency: Making Python SDK consistent with the Java SDK.
> > We
> > >> > have
> > >> > > come a long way on this and have only a few items left [4].
> > >> > > * Beamifying: We have been working on removing Dataflow-specific
> > >> > references
> > >> > > both from the documentation and from the code. There is some work
> > left,
> > >> > and
> > >> > > we are currently working on those as well [5].
> > >> > >
> > >> > > ** Steps and implications of merging to master:
> > >> > >
> > >> > > * Master branch is merged to python-sdk branch at regular
> intervals
> > and
> > >> > the
> > >> > > last merge was on 12/22. All the past merges were uneventful
> because
> > >> > there
> > >> > > is a minimal overlap in modified files between branches.
> Integrating
> > >> > > python-sdk to master will similarly touch a small number of
> existing
> > >> > files.
> > >> > >
> > >> > > * Python SDK is using the same tools for building and testing. It
> is
> > >> > > already integrated with Maven, Jenkins and Travis. Specifically
> the
> > >> > impact
> > >> > > to the testing infrastructure would be:
> > >> > > - There will be two additional test configurations in Travis.
> Since
> > >> > Travis
> > >> > > runs all configurations in parallel there should not be a
> noticeable
> > >> > change
> > >> > > in the Travis run time.
> > >> > > - Jenkins pre-commit test will start running the Python SDK tests.
> > It
> > >> > will
> > >> > > add an additional 5 minutes to the completion time of pre-commit
> > test.
> > >> > > Historically Python SDK tests were not flaky and did not cause any
> > >> random
> > >> > > failures.
> > >> > > - Jenkins Python post-commit test is already separated from the
> > other
> > >> > > post-commit tests and will continue to exist. It would not change
> > the
> > >> > > testing time for any other test.
> > >> > >
> > >> > > * The release process needs to be updated to accommodate releasing
> > >> Python
> > >> > > artifacts. Python SDK would fit in the existing release schedule
> and
> > >> > could
> > >> > > be released along with the Java SDK. The additional steps would
> > >> include:
> > >> > > - Generating Python artifacts. This could be done with a single
> > command
> > >> > > using Maven today.
> > >> > > - Publishing the artifacts to a central repository such as PyPI.
> > >> > >
> > >> >
> > >> > I'm more than happy to help on this. We left on purpose some things
> > open
> > >> > when we added Maven support to the Python build.
> > >> >
> > >>
> > >> That would be awesome. We can coordinate on that post-merge.
> > >>
> > >>
> > >> >
> > >> >
> > >> >
> > >> > > - Updating the release guide to reflect the changes above.
> > >> > >
> > >> > > * Users: There are existing users using the Python SDK. To give a
> > rough
> > >> > > estimate, a distribution of the Beam Python SDK had a total of 23K
> > >> > > downloads in the past 6 months [6]. Some of those users are
> already
> > >> > engaged
> > >> > > with the community (e.g. [7]). There might be an increased amount
> > >> > > engagement from the rest of them after the merge.
> > >> > >
> > >> >
> > >> > Python 3 support is something we definitively need to look ahead.
> I'd
> > try
> > >> > to make the codebase compatible with both 2.7.x and 3.6.x, rather
> than
> > >> > using other  solutions like 2to3.
> > >> >
> > >>
> > >> I agree with you. I think it makes more sense to make codebase
> > compatible
> > >> with both. As you mentioned Python 3 support is not a short-term goal
> in
> > >> the roadmap, and we can discuss it more as we approach that.
> > >>
> > >>
> > >> >
> > >> >
> > >> > Looking forward to hearing your thoughts and comments on
> “graduating”
> > >> > > python-sdk to the master.
> > >> > >
> > >> > > Thank you,
> > >> > > Ahmet
> > >> > >
> > >> > > (*) Python SDK branch currently has a diverse group of
> contributors.
> > >> > > Regular contributors include Charles Chen, Chamikara Jayalath,
> María
> > >> > García
> > >> > > Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam
> PMC),
> > >> > > Sourabh Bajaj, and Vikas Kedigehalli. We have also had
> contributions
> > >> from
> > >> > > Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee,
> and
> > >> > > Younghee Kwon.
> > >> > >
> > >> > > [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
> > >> > > [2] https://beam.apache.org/documentation/programming-guide/
> > >> > > [3] https://issues.apache.org/jira/browse/BEAM-1265
> > >> > > [4]
> > >> > > https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
> > >> > > en%20AND%20labels%20%3D%20sdk-consistency
> > >> > > [5] https://issues.apache.org/jira/browse/BEAM-1218
> > >> > > [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
> > >> > > [7] https://issues.apache.org/jira/browse/BEAM-1251
> > >> > >
> > >> >
> > >> >
> > >> > Great summary, Ahmet. Thanks.
> > >> >
> > >> > Cheers,
> > >> >
> > >> > --
> > >> > Sergio Fernández
> > >> > Partner Technology Manager
> > >> > Redlink GmbH
> > >> > m: +43 6602747925
> > >> > e: sergio.fernandez@redlink.co
> > >> > w: http://redlink.co
> > >> >
> > >>
> >
>

Re: [DISCUSS] Python SDK status and next steps

Posted by Kenneth Knowles <kl...@google.com.INVALID>.
To clarify the implied criteria of that last exchange, it is "An SDK should
have at least one runner that can execute the complete model (may be a
direct runner)"

I want to highlight this, because whether an _SDK_ supports unbounded data
is not particularly well-defined, and will evolve:

 - With the Runner API, an SDK will need to support building a graph with
unbounded constructs, as today with probably minimal changes.

 - With the Fn API, if any part of the Fn API is specific to unbounded
data, the SDK will need to implement it. I think right now there is no such
thing, and we don't want such a thing, so SDKs implementing the Fn API
automatically support unbounded data.

 - There will also likely be an SDK-specific shim just as there is today,
to leverage idiomatic deserialized representations. The richness of this
shim will decrease so that it will need to "support" unbounded data but
that will be a ~one liner.

Getting the Python SDK on master will accelerate our progress towards the
Fn API - partly technical, partly community - which is the best path
towards support for unbounded data across multiple runners. I think the
criteria are written with the completed portability framework in mind. So
this exchange makes me actually more convinced we should merge python-sdk
to master.

On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
robertwb@google.com.invalid> wrote:

> On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
> <dh...@google.com.invalid> wrote:
> > I do not think that Python SDK yet meets the bar [1] for implementing the
> > Beam model -- supporting Unbounded data is very important. That said,
> given
> > the committed and sustained set of contributors, it generally makes sense
> > to me to make an exception in anticipation of these features being
> fleshed
> > out soon; including potentially new users/contributors that would arrive
> > once in master.
> >
> > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> > k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com
>
> That is a valid point. The Python SDK supports all the unbounded parts
> of the model except for unbounded sources, which was deferred while
> seeing how https://s.apache.org/splittable-do-fn played out. I've been
> working with the team and merging/reviewing most of their code, and
> have full confidence this will be coming (and on that note can vouch
> for a healthy community and support which are much harder to add
> later).
>
> In short, I think it has the required maturity, and I'm in favor of
> merging soonish.
>
> > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay <al...@google.com.invalid>
> > wrote:
> >
> >> Thank you all for the comments so far. I would follow the process as
> >> suggested by Davor and others in this thread.
> >>
> >> Ahmet
> >>
> >> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <wi...@apache.org>
> >> wrote:
> >>
> >> > Hi
> >> >
> >> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay <altay@google.com.invalid
> >
> >> > wrote:
> >> > >
> >> > > tl;dr: I would like to start a discussion about merging python-sdk
> >> branch
> >> > > to master branch. Python SDK is mature enough and merging it to
> master
> >> > will
> >> > > accelerate its development and adoption.
> >> > >
> >> >
> >> > Good point, Ahmet!
> >> >
> >> > I've following closed the development since it was imported in June.
> For
> >> > the prototypes I've implemented so far it works quite well; I guess
> we'd
> >> > just need to focus the next months in bringing more runners support.
> >> >
> >> > With a great effort from a lot of contributors(*), Python SDK [1] is
> now
> >> a
> >> > > mostly complete, tested, performant Python implementation of the
> Beam
> >> > > model. Since June, when we first started with Python SDK in Apache
> Beam
> >> > we
> >> > > have been continuously improving it.
> >> > >
> >> >
> >> > I wouldn't merge during the preparation of 0.5.0 release, but after
> that
> >> > could be a good time to merge back into master.
> >> >
> >> >
> >> > ** Python SDK currently supports:
> >> > >
> >> > > * Model: All main concepts are present (ParDo, GroupByKey, Windowing
> >> > etc.).
> >> > > * IO: There are extensible APIs for writing new bounded sources and
> >> > sinks.
> >> > > Implementations are provided for Text, Avro, BigQuery, and
> Datastore.
> >> > > * Runners: Python SDK has an extensible base runner module that
> allows
> >> > > building specific runners on top of it. The SDK comes with two
> pipeline
> >> > > runners: DirectRunner and DataflowRunner; and it is possible to add
> >> more.
> >> > > The existing runners are currently limited to bounded execution and
> >> > > otherwise equivalent to their Java SDK counterparts in
> functionality.
> >> > >
> >> >
> >> > What would the effort of porting, and maintaining, parallel versions
> of
> >> the
> >> > Java runners? I guess I'd need to dig deeper in the model, but this
> may
> >> > represent a major effort for the project, right?
> >> >
> >>
> >> It is somewhat higher for DirectRunner because DirectRunner also
> implements
> >> the code for execution. It is not that high for DataflowRunner because
> the
> >> base runner module has a lot of helpers with the right hooks for
> >> implementing a generic runner. I would _expect_ the experience in
> general
> >> would be similar to the latter.
> >>
> >>
> >> >
> >> >
> >> >
> >> > > * Testing: Python SDK implements ValidatesRunner test framework for
> >> > > implementing integration test for current and future runners. There
> is
> >> > unit
> >> > > test coverage for all modules, and a number of integrations test for
> >> > > validating existing runners.
> >> > > * Documentation and examples: Documentation work has started on
> Python
> >> > SDK.
> >> > > Beam Programming Guide page has been updated to include Python [2].
> The
> >> > > code comes with many ready to use examples and we are in a good
> place
> >> to
> >> > > start documenting those on the website.
> >> > >
> >> > > ** We are not done yet, next on the roadmap we have:
> >> > >
> >> > > * Streaming: Both of the existing runners lack support for streaming
> >> > > execution, and currently there is work going on for adding streaming
> >> > > support to DirectRunner [3].
> >> > > * Documentation: Filling the rest of the Beam documentations with
> >> Python
> >> > > SDK specific information and examples.
> >> > > * SDK consistency: Making Python SDK consistent with the Java SDK.
> We
> >> > have
> >> > > come a long way on this and have only a few items left [4].
> >> > > * Beamifying: We have been working on removing Dataflow-specific
> >> > references
> >> > > both from the documentation and from the code. There is some work
> left,
> >> > and
> >> > > we are currently working on those as well [5].
> >> > >
> >> > > ** Steps and implications of merging to master:
> >> > >
> >> > > * Master branch is merged to python-sdk branch at regular intervals
> and
> >> > the
> >> > > last merge was on 12/22. All the past merges were uneventful because
> >> > there
> >> > > is a minimal overlap in modified files between branches. Integrating
> >> > > python-sdk to master will similarly touch a small number of existing
> >> > files.
> >> > >
> >> > > * Python SDK is using the same tools for building and testing. It is
> >> > > already integrated with Maven, Jenkins and Travis. Specifically the
> >> > impact
> >> > > to the testing infrastructure would be:
> >> > > - There will be two additional test configurations in Travis. Since
> >> > Travis
> >> > > runs all configurations in parallel there should not be a noticeable
> >> > change
> >> > > in the Travis run time.
> >> > > - Jenkins pre-commit test will start running the Python SDK tests.
> It
> >> > will
> >> > > add an additional 5 minutes to the completion time of pre-commit
> test.
> >> > > Historically Python SDK tests were not flaky and did not cause any
> >> random
> >> > > failures.
> >> > > - Jenkins Python post-commit test is already separated from the
> other
> >> > > post-commit tests and will continue to exist. It would not change
> the
> >> > > testing time for any other test.
> >> > >
> >> > > * The release process needs to be updated to accommodate releasing
> >> Python
> >> > > artifacts. Python SDK would fit in the existing release schedule and
> >> > could
> >> > > be released along with the Java SDK. The additional steps would
> >> include:
> >> > > - Generating Python artifacts. This could be done with a single
> command
> >> > > using Maven today.
> >> > > - Publishing the artifacts to a central repository such as PyPI.
> >> > >
> >> >
> >> > I'm more than happy to help on this. We left on purpose some things
> open
> >> > when we added Maven support to the Python build.
> >> >
> >>
> >> That would be awesome. We can coordinate on that post-merge.
> >>
> >>
> >> >
> >> >
> >> >
> >> > > - Updating the release guide to reflect the changes above.
> >> > >
> >> > > * Users: There are existing users using the Python SDK. To give a
> rough
> >> > > estimate, a distribution of the Beam Python SDK had a total of 23K
> >> > > downloads in the past 6 months [6]. Some of those users are already
> >> > engaged
> >> > > with the community (e.g. [7]). There might be an increased amount
> >> > > engagement from the rest of them after the merge.
> >> > >
> >> >
> >> > Python 3 support is something we definitively need to look ahead. I'd
> try
> >> > to make the codebase compatible with both 2.7.x and 3.6.x, rather than
> >> > using other  solutions like 2to3.
> >> >
> >>
> >> I agree with you. I think it makes more sense to make codebase
> compatible
> >> with both. As you mentioned Python 3 support is not a short-term goal in
> >> the roadmap, and we can discuss it more as we approach that.
> >>
> >>
> >> >
> >> >
> >> > Looking forward to hearing your thoughts and comments on “graduating”
> >> > > python-sdk to the master.
> >> > >
> >> > > Thank you,
> >> > > Ahmet
> >> > >
> >> > > (*) Python SDK branch currently has a diverse group of contributors.
> >> > > Regular contributors include Charles Chen, Chamikara Jayalath, María
> >> > García
> >> > > Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
> >> > > Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions
> >> from
> >> > > Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and
> >> > > Younghee Kwon.
> >> > >
> >> > > [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
> >> > > [2] https://beam.apache.org/documentation/programming-guide/
> >> > > [3] https://issues.apache.org/jira/browse/BEAM-1265
> >> > > [4]
> >> > > https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
> >> > > en%20AND%20labels%20%3D%20sdk-consistency
> >> > > [5] https://issues.apache.org/jira/browse/BEAM-1218
> >> > > [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
> >> > > [7] https://issues.apache.org/jira/browse/BEAM-1251
> >> > >
> >> >
> >> >
> >> > Great summary, Ahmet. Thanks.
> >> >
> >> > Cheers,
> >> >
> >> > --
> >> > Sergio Fernández
> >> > Partner Technology Manager
> >> > Redlink GmbH
> >> > m: +43 6602747925
> >> > e: sergio.fernandez@redlink.co
> >> > w: http://redlink.co
> >> >
> >>
>

Re: [DISCUSS] Python SDK status and next steps

Posted by Robert Bradshaw <ro...@google.com.INVALID>.
On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
<dh...@google.com.invalid> wrote:
> I do not think that Python SDK yet meets the bar [1] for implementing the
> Beam model -- supporting Unbounded data is very important. That said, given
> the committed and sustained set of contributors, it generally makes sense
> to me to make an exception in anticipation of these features being fleshed
> out soon; including potentially new users/contributors that would arrive
> once in master.
>
> [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com

That is a valid point. The Python SDK supports all the unbounded parts
of the model except for unbounded sources, which was deferred while
seeing how https://s.apache.org/splittable-do-fn played out. I've been
working with the team and merging/reviewing most of their code, and
have full confidence this will be coming (and on that note can vouch
for a healthy community and support which are much harder to add
later).

In short, I think it has the required maturity, and I'm in favor of
merging soonish.

> On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay <al...@google.com.invalid>
> wrote:
>
>> Thank you all for the comments so far. I would follow the process as
>> suggested by Davor and others in this thread.
>>
>> Ahmet
>>
>> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <wi...@apache.org>
>> wrote:
>>
>> > Hi
>> >
>> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay <al...@google.com.invalid>
>> > wrote:
>> > >
>> > > tl;dr: I would like to start a discussion about merging python-sdk
>> branch
>> > > to master branch. Python SDK is mature enough and merging it to master
>> > will
>> > > accelerate its development and adoption.
>> > >
>> >
>> > Good point, Ahmet!
>> >
>> > I've following closed the development since it was imported in June. For
>> > the prototypes I've implemented so far it works quite well; I guess we'd
>> > just need to focus the next months in bringing more runners support.
>> >
>> > With a great effort from a lot of contributors(*), Python SDK [1] is now
>> a
>> > > mostly complete, tested, performant Python implementation of the Beam
>> > > model. Since June, when we first started with Python SDK in Apache Beam
>> > we
>> > > have been continuously improving it.
>> > >
>> >
>> > I wouldn't merge during the preparation of 0.5.0 release, but after that
>> > could be a good time to merge back into master.
>> >
>> >
>> > ** Python SDK currently supports:
>> > >
>> > > * Model: All main concepts are present (ParDo, GroupByKey, Windowing
>> > etc.).
>> > > * IO: There are extensible APIs for writing new bounded sources and
>> > sinks.
>> > > Implementations are provided for Text, Avro, BigQuery, and Datastore.
>> > > * Runners: Python SDK has an extensible base runner module that allows
>> > > building specific runners on top of it. The SDK comes with two pipeline
>> > > runners: DirectRunner and DataflowRunner; and it is possible to add
>> more.
>> > > The existing runners are currently limited to bounded execution and
>> > > otherwise equivalent to their Java SDK counterparts in functionality.
>> > >
>> >
>> > What would the effort of porting, and maintaining, parallel versions of
>> the
>> > Java runners? I guess I'd need to dig deeper in the model, but this may
>> > represent a major effort for the project, right?
>> >
>>
>> It is somewhat higher for DirectRunner because DirectRunner also implements
>> the code for execution. It is not that high for DataflowRunner because the
>> base runner module has a lot of helpers with the right hooks for
>> implementing a generic runner. I would _expect_ the experience in general
>> would be similar to the latter.
>>
>>
>> >
>> >
>> >
>> > > * Testing: Python SDK implements ValidatesRunner test framework for
>> > > implementing integration test for current and future runners. There is
>> > unit
>> > > test coverage for all modules, and a number of integrations test for
>> > > validating existing runners.
>> > > * Documentation and examples: Documentation work has started on Python
>> > SDK.
>> > > Beam Programming Guide page has been updated to include Python [2]. The
>> > > code comes with many ready to use examples and we are in a good place
>> to
>> > > start documenting those on the website.
>> > >
>> > > ** We are not done yet, next on the roadmap we have:
>> > >
>> > > * Streaming: Both of the existing runners lack support for streaming
>> > > execution, and currently there is work going on for adding streaming
>> > > support to DirectRunner [3].
>> > > * Documentation: Filling the rest of the Beam documentations with
>> Python
>> > > SDK specific information and examples.
>> > > * SDK consistency: Making Python SDK consistent with the Java SDK. We
>> > have
>> > > come a long way on this and have only a few items left [4].
>> > > * Beamifying: We have been working on removing Dataflow-specific
>> > references
>> > > both from the documentation and from the code. There is some work left,
>> > and
>> > > we are currently working on those as well [5].
>> > >
>> > > ** Steps and implications of merging to master:
>> > >
>> > > * Master branch is merged to python-sdk branch at regular intervals and
>> > the
>> > > last merge was on 12/22. All the past merges were uneventful because
>> > there
>> > > is a minimal overlap in modified files between branches. Integrating
>> > > python-sdk to master will similarly touch a small number of existing
>> > files.
>> > >
>> > > * Python SDK is using the same tools for building and testing. It is
>> > > already integrated with Maven, Jenkins and Travis. Specifically the
>> > impact
>> > > to the testing infrastructure would be:
>> > > - There will be two additional test configurations in Travis. Since
>> > Travis
>> > > runs all configurations in parallel there should not be a noticeable
>> > change
>> > > in the Travis run time.
>> > > - Jenkins pre-commit test will start running the Python SDK tests. It
>> > will
>> > > add an additional 5 minutes to the completion time of pre-commit test.
>> > > Historically Python SDK tests were not flaky and did not cause any
>> random
>> > > failures.
>> > > - Jenkins Python post-commit test is already separated from the other
>> > > post-commit tests and will continue to exist. It would not change the
>> > > testing time for any other test.
>> > >
>> > > * The release process needs to be updated to accommodate releasing
>> Python
>> > > artifacts. Python SDK would fit in the existing release schedule and
>> > could
>> > > be released along with the Java SDK. The additional steps would
>> include:
>> > > - Generating Python artifacts. This could be done with a single command
>> > > using Maven today.
>> > > - Publishing the artifacts to a central repository such as PyPI.
>> > >
>> >
>> > I'm more than happy to help on this. We left on purpose some things open
>> > when we added Maven support to the Python build.
>> >
>>
>> That would be awesome. We can coordinate on that post-merge.
>>
>>
>> >
>> >
>> >
>> > > - Updating the release guide to reflect the changes above.
>> > >
>> > > * Users: There are existing users using the Python SDK. To give a rough
>> > > estimate, a distribution of the Beam Python SDK had a total of 23K
>> > > downloads in the past 6 months [6]. Some of those users are already
>> > engaged
>> > > with the community (e.g. [7]). There might be an increased amount
>> > > engagement from the rest of them after the merge.
>> > >
>> >
>> > Python 3 support is something we definitively need to look ahead. I'd try
>> > to make the codebase compatible with both 2.7.x and 3.6.x, rather than
>> > using other  solutions like 2to3.
>> >
>>
>> I agree with you. I think it makes more sense to make codebase compatible
>> with both. As you mentioned Python 3 support is not a short-term goal in
>> the roadmap, and we can discuss it more as we approach that.
>>
>>
>> >
>> >
>> > Looking forward to hearing your thoughts and comments on “graduating”
>> > > python-sdk to the master.
>> > >
>> > > Thank you,
>> > > Ahmet
>> > >
>> > > (*) Python SDK branch currently has a diverse group of contributors.
>> > > Regular contributors include Charles Chen, Chamikara Jayalath, María
>> > García
>> > > Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
>> > > Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions
>> from
>> > > Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and
>> > > Younghee Kwon.
>> > >
>> > > [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
>> > > [2] https://beam.apache.org/documentation/programming-guide/
>> > > [3] https://issues.apache.org/jira/browse/BEAM-1265
>> > > [4]
>> > > https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
>> > > en%20AND%20labels%20%3D%20sdk-consistency
>> > > [5] https://issues.apache.org/jira/browse/BEAM-1218
>> > > [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
>> > > [7] https://issues.apache.org/jira/browse/BEAM-1251
>> > >
>> >
>> >
>> > Great summary, Ahmet. Thanks.
>> >
>> > Cheers,
>> >
>> > --
>> > Sergio Fernández
>> > Partner Technology Manager
>> > Redlink GmbH
>> > m: +43 6602747925
>> > e: sergio.fernandez@redlink.co
>> > w: http://redlink.co
>> >
>>

Re: [DISCUSS] Python SDK status and next steps

Posted by Dan Halperin <dh...@google.com.INVALID>.
I do not think that Python SDK yet meets the bar [1] for implementing the
Beam model -- supporting Unbounded data is very important. That said, given
the committed and sustained set of contributors, it generally makes sense
to me to make an exception in anticipation of these features being fleshed
out soon; including potentially new users/contributors that would arrive
once in master.

[1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
k0PLmM3f5e5BqwJz4+C5doRucLNxOh8w@mail.gmail.com

On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay <al...@google.com.invalid>
wrote:

> Thank you all for the comments so far. I would follow the process as
> suggested by Davor and others in this thread.
>
> Ahmet
>
> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <wi...@apache.org>
> wrote:
>
> > Hi
> >
> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay <al...@google.com.invalid>
> > wrote:
> > >
> > > tl;dr: I would like to start a discussion about merging python-sdk
> branch
> > > to master branch. Python SDK is mature enough and merging it to master
> > will
> > > accelerate its development and adoption.
> > >
> >
> > Good point, Ahmet!
> >
> > I've following closed the development since it was imported in June. For
> > the prototypes I've implemented so far it works quite well; I guess we'd
> > just need to focus the next months in bringing more runners support.
> >
> > With a great effort from a lot of contributors(*), Python SDK [1] is now
> a
> > > mostly complete, tested, performant Python implementation of the Beam
> > > model. Since June, when we first started with Python SDK in Apache Beam
> > we
> > > have been continuously improving it.
> > >
> >
> > I wouldn't merge during the preparation of 0.5.0 release, but after that
> > could be a good time to merge back into master.
> >
> >
> > ** Python SDK currently supports:
> > >
> > > * Model: All main concepts are present (ParDo, GroupByKey, Windowing
> > etc.).
> > > * IO: There are extensible APIs for writing new bounded sources and
> > sinks.
> > > Implementations are provided for Text, Avro, BigQuery, and Datastore.
> > > * Runners: Python SDK has an extensible base runner module that allows
> > > building specific runners on top of it. The SDK comes with two pipeline
> > > runners: DirectRunner and DataflowRunner; and it is possible to add
> more.
> > > The existing runners are currently limited to bounded execution and
> > > otherwise equivalent to their Java SDK counterparts in functionality.
> > >
> >
> > What would the effort of porting, and maintaining, parallel versions of
> the
> > Java runners? I guess I'd need to dig deeper in the model, but this may
> > represent a major effort for the project, right?
> >
>
> It is somewhat higher for DirectRunner because DirectRunner also implements
> the code for execution. It is not that high for DataflowRunner because the
> base runner module has a lot of helpers with the right hooks for
> implementing a generic runner. I would _expect_ the experience in general
> would be similar to the latter.
>
>
> >
> >
> >
> > > * Testing: Python SDK implements ValidatesRunner test framework for
> > > implementing integration test for current and future runners. There is
> > unit
> > > test coverage for all modules, and a number of integrations test for
> > > validating existing runners.
> > > * Documentation and examples: Documentation work has started on Python
> > SDK.
> > > Beam Programming Guide page has been updated to include Python [2]. The
> > > code comes with many ready to use examples and we are in a good place
> to
> > > start documenting those on the website.
> > >
> > > ** We are not done yet, next on the roadmap we have:
> > >
> > > * Streaming: Both of the existing runners lack support for streaming
> > > execution, and currently there is work going on for adding streaming
> > > support to DirectRunner [3].
> > > * Documentation: Filling the rest of the Beam documentations with
> Python
> > > SDK specific information and examples.
> > > * SDK consistency: Making Python SDK consistent with the Java SDK. We
> > have
> > > come a long way on this and have only a few items left [4].
> > > * Beamifying: We have been working on removing Dataflow-specific
> > references
> > > both from the documentation and from the code. There is some work left,
> > and
> > > we are currently working on those as well [5].
> > >
> > > ** Steps and implications of merging to master:
> > >
> > > * Master branch is merged to python-sdk branch at regular intervals and
> > the
> > > last merge was on 12/22. All the past merges were uneventful because
> > there
> > > is a minimal overlap in modified files between branches. Integrating
> > > python-sdk to master will similarly touch a small number of existing
> > files.
> > >
> > > * Python SDK is using the same tools for building and testing. It is
> > > already integrated with Maven, Jenkins and Travis. Specifically the
> > impact
> > > to the testing infrastructure would be:
> > > - There will be two additional test configurations in Travis. Since
> > Travis
> > > runs all configurations in parallel there should not be a noticeable
> > change
> > > in the Travis run time.
> > > - Jenkins pre-commit test will start running the Python SDK tests. It
> > will
> > > add an additional 5 minutes to the completion time of pre-commit test.
> > > Historically Python SDK tests were not flaky and did not cause any
> random
> > > failures.
> > > - Jenkins Python post-commit test is already separated from the other
> > > post-commit tests and will continue to exist. It would not change the
> > > testing time for any other test.
> > >
> > > * The release process needs to be updated to accommodate releasing
> Python
> > > artifacts. Python SDK would fit in the existing release schedule and
> > could
> > > be released along with the Java SDK. The additional steps would
> include:
> > > - Generating Python artifacts. This could be done with a single command
> > > using Maven today.
> > > - Publishing the artifacts to a central repository such as PyPI.
> > >
> >
> > I'm more than happy to help on this. We left on purpose some things open
> > when we added Maven support to the Python build.
> >
>
> That would be awesome. We can coordinate on that post-merge.
>
>
> >
> >
> >
> > > - Updating the release guide to reflect the changes above.
> > >
> > > * Users: There are existing users using the Python SDK. To give a rough
> > > estimate, a distribution of the Beam Python SDK had a total of 23K
> > > downloads in the past 6 months [6]. Some of those users are already
> > engaged
> > > with the community (e.g. [7]). There might be an increased amount
> > > engagement from the rest of them after the merge.
> > >
> >
> > Python 3 support is something we definitively need to look ahead. I'd try
> > to make the codebase compatible with both 2.7.x and 3.6.x, rather than
> > using other  solutions like 2to3.
> >
>
> I agree with you. I think it makes more sense to make codebase compatible
> with both. As you mentioned Python 3 support is not a short-term goal in
> the roadmap, and we can discuss it more as we approach that.
>
>
> >
> >
> > Looking forward to hearing your thoughts and comments on “graduating”
> > > python-sdk to the master.
> > >
> > > Thank you,
> > > Ahmet
> > >
> > > (*) Python SDK branch currently has a diverse group of contributors.
> > > Regular contributors include Charles Chen, Chamikara Jayalath, María
> > García
> > > Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
> > > Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions
> from
> > > Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and
> > > Younghee Kwon.
> > >
> > > [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
> > > [2] https://beam.apache.org/documentation/programming-guide/
> > > [3] https://issues.apache.org/jira/browse/BEAM-1265
> > > [4]
> > > https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
> > > en%20AND%20labels%20%3D%20sdk-consistency
> > > [5] https://issues.apache.org/jira/browse/BEAM-1218
> > > [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
> > > [7] https://issues.apache.org/jira/browse/BEAM-1251
> > >
> >
> >
> > Great summary, Ahmet. Thanks.
> >
> > Cheers,
> >
> > --
> > Sergio Fernández
> > Partner Technology Manager
> > Redlink GmbH
> > m: +43 6602747925
> > e: sergio.fernandez@redlink.co
> > w: http://redlink.co
> >
>

Re: [DISCUSS] Python SDK status and next steps

Posted by Ahmet Altay <al...@google.com.INVALID>.
Thank you all for the comments so far. I would follow the process as
suggested by Davor and others in this thread.

Ahmet

On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <wi...@apache.org>
wrote:

> Hi
>
> On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay <al...@google.com.invalid>
> wrote:
> >
> > tl;dr: I would like to start a discussion about merging python-sdk branch
> > to master branch. Python SDK is mature enough and merging it to master
> will
> > accelerate its development and adoption.
> >
>
> Good point, Ahmet!
>
> I've following closed the development since it was imported in June. For
> the prototypes I've implemented so far it works quite well; I guess we'd
> just need to focus the next months in bringing more runners support.
>
> With a great effort from a lot of contributors(*), Python SDK [1] is now a
> > mostly complete, tested, performant Python implementation of the Beam
> > model. Since June, when we first started with Python SDK in Apache Beam
> we
> > have been continuously improving it.
> >
>
> I wouldn't merge during the preparation of 0.5.0 release, but after that
> could be a good time to merge back into master.
>
>
> ** Python SDK currently supports:
> >
> > * Model: All main concepts are present (ParDo, GroupByKey, Windowing
> etc.).
> > * IO: There are extensible APIs for writing new bounded sources and
> sinks.
> > Implementations are provided for Text, Avro, BigQuery, and Datastore.
> > * Runners: Python SDK has an extensible base runner module that allows
> > building specific runners on top of it. The SDK comes with two pipeline
> > runners: DirectRunner and DataflowRunner; and it is possible to add more.
> > The existing runners are currently limited to bounded execution and
> > otherwise equivalent to their Java SDK counterparts in functionality.
> >
>
> What would the effort of porting, and maintaining, parallel versions of the
> Java runners? I guess I'd need to dig deeper in the model, but this may
> represent a major effort for the project, right?
>

It is somewhat higher for DirectRunner because DirectRunner also implements
the code for execution. It is not that high for DataflowRunner because the
base runner module has a lot of helpers with the right hooks for
implementing a generic runner. I would _expect_ the experience in general
would be similar to the latter.


>
>
>
> > * Testing: Python SDK implements ValidatesRunner test framework for
> > implementing integration test for current and future runners. There is
> unit
> > test coverage for all modules, and a number of integrations test for
> > validating existing runners.
> > * Documentation and examples: Documentation work has started on Python
> SDK.
> > Beam Programming Guide page has been updated to include Python [2]. The
> > code comes with many ready to use examples and we are in a good place to
> > start documenting those on the website.
> >
> > ** We are not done yet, next on the roadmap we have:
> >
> > * Streaming: Both of the existing runners lack support for streaming
> > execution, and currently there is work going on for adding streaming
> > support to DirectRunner [3].
> > * Documentation: Filling the rest of the Beam documentations with Python
> > SDK specific information and examples.
> > * SDK consistency: Making Python SDK consistent with the Java SDK. We
> have
> > come a long way on this and have only a few items left [4].
> > * Beamifying: We have been working on removing Dataflow-specific
> references
> > both from the documentation and from the code. There is some work left,
> and
> > we are currently working on those as well [5].
> >
> > ** Steps and implications of merging to master:
> >
> > * Master branch is merged to python-sdk branch at regular intervals and
> the
> > last merge was on 12/22. All the past merges were uneventful because
> there
> > is a minimal overlap in modified files between branches. Integrating
> > python-sdk to master will similarly touch a small number of existing
> files.
> >
> > * Python SDK is using the same tools for building and testing. It is
> > already integrated with Maven, Jenkins and Travis. Specifically the
> impact
> > to the testing infrastructure would be:
> > - There will be two additional test configurations in Travis. Since
> Travis
> > runs all configurations in parallel there should not be a noticeable
> change
> > in the Travis run time.
> > - Jenkins pre-commit test will start running the Python SDK tests. It
> will
> > add an additional 5 minutes to the completion time of pre-commit test.
> > Historically Python SDK tests were not flaky and did not cause any random
> > failures.
> > - Jenkins Python post-commit test is already separated from the other
> > post-commit tests and will continue to exist. It would not change the
> > testing time for any other test.
> >
> > * The release process needs to be updated to accommodate releasing Python
> > artifacts. Python SDK would fit in the existing release schedule and
> could
> > be released along with the Java SDK. The additional steps would include:
> > - Generating Python artifacts. This could be done with a single command
> > using Maven today.
> > - Publishing the artifacts to a central repository such as PyPI.
> >
>
> I'm more than happy to help on this. We left on purpose some things open
> when we added Maven support to the Python build.
>

That would be awesome. We can coordinate on that post-merge.


>
>
>
> > - Updating the release guide to reflect the changes above.
> >
> > * Users: There are existing users using the Python SDK. To give a rough
> > estimate, a distribution of the Beam Python SDK had a total of 23K
> > downloads in the past 6 months [6]. Some of those users are already
> engaged
> > with the community (e.g. [7]). There might be an increased amount
> > engagement from the rest of them after the merge.
> >
>
> Python 3 support is something we definitively need to look ahead. I'd try
> to make the codebase compatible with both 2.7.x and 3.6.x, rather than
> using other  solutions like 2to3.
>

I agree with you. I think it makes more sense to make codebase compatible
with both. As you mentioned Python 3 support is not a short-term goal in
the roadmap, and we can discuss it more as we approach that.


>
>
> Looking forward to hearing your thoughts and comments on “graduating”
> > python-sdk to the master.
> >
> > Thank you,
> > Ahmet
> >
> > (*) Python SDK branch currently has a diverse group of contributors.
> > Regular contributors include Charles Chen, Chamikara Jayalath, María
> García
> > Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
> > Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions from
> > Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and
> > Younghee Kwon.
> >
> > [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
> > [2] https://beam.apache.org/documentation/programming-guide/
> > [3] https://issues.apache.org/jira/browse/BEAM-1265
> > [4]
> > https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
> > en%20AND%20labels%20%3D%20sdk-consistency
> > [5] https://issues.apache.org/jira/browse/BEAM-1218
> > [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
> > [7] https://issues.apache.org/jira/browse/BEAM-1251
> >
>
>
> Great summary, Ahmet. Thanks.
>
> Cheers,
>
> --
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925
> e: sergio.fernandez@redlink.co
> w: http://redlink.co
>

Re: [DISCUSS] Python SDK status and next steps

Posted by Sergio Fernández <wi...@apache.org>.
Hi

On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay <al...@google.com.invalid>
wrote:
>
> tl;dr: I would like to start a discussion about merging python-sdk branch
> to master branch. Python SDK is mature enough and merging it to master will
> accelerate its development and adoption.
>

Good point, Ahmet!

I've following closed the development since it was imported in June. For
the prototypes I've implemented so far it works quite well; I guess we'd
just need to focus the next months in bringing more runners support.

With a great effort from a lot of contributors(*), Python SDK [1] is now a
> mostly complete, tested, performant Python implementation of the Beam
> model. Since June, when we first started with Python SDK in Apache Beam we
> have been continuously improving it.
>

I wouldn't merge during the preparation of 0.5.0 release, but after that
could be a good time to merge back into master.


** Python SDK currently supports:
>
> * Model: All main concepts are present (ParDo, GroupByKey, Windowing etc.).
> * IO: There are extensible APIs for writing new bounded sources and sinks.
> Implementations are provided for Text, Avro, BigQuery, and Datastore.
> * Runners: Python SDK has an extensible base runner module that allows
> building specific runners on top of it. The SDK comes with two pipeline
> runners: DirectRunner and DataflowRunner; and it is possible to add more.
> The existing runners are currently limited to bounded execution and
> otherwise equivalent to their Java SDK counterparts in functionality.
>

What would the effort of porting, and maintaining, parallel versions of the
Java runners? I guess I'd need to dig deeper in the model, but this may
represent a major effort for the project, right?



> * Testing: Python SDK implements ValidatesRunner test framework for
> implementing integration test for current and future runners. There is unit
> test coverage for all modules, and a number of integrations test for
> validating existing runners.
> * Documentation and examples: Documentation work has started on Python SDK.
> Beam Programming Guide page has been updated to include Python [2]. The
> code comes with many ready to use examples and we are in a good place to
> start documenting those on the website.
>
> ** We are not done yet, next on the roadmap we have:
>
> * Streaming: Both of the existing runners lack support for streaming
> execution, and currently there is work going on for adding streaming
> support to DirectRunner [3].
> * Documentation: Filling the rest of the Beam documentations with Python
> SDK specific information and examples.
> * SDK consistency: Making Python SDK consistent with the Java SDK. We have
> come a long way on this and have only a few items left [4].
> * Beamifying: We have been working on removing Dataflow-specific references
> both from the documentation and from the code. There is some work left, and
> we are currently working on those as well [5].
>
> ** Steps and implications of merging to master:
>
> * Master branch is merged to python-sdk branch at regular intervals and the
> last merge was on 12/22. All the past merges were uneventful because there
> is a minimal overlap in modified files between branches. Integrating
> python-sdk to master will similarly touch a small number of existing files.
>
> * Python SDK is using the same tools for building and testing. It is
> already integrated with Maven, Jenkins and Travis. Specifically the impact
> to the testing infrastructure would be:
> - There will be two additional test configurations in Travis. Since Travis
> runs all configurations in parallel there should not be a noticeable change
> in the Travis run time.
> - Jenkins pre-commit test will start running the Python SDK tests. It will
> add an additional 5 minutes to the completion time of pre-commit test.
> Historically Python SDK tests were not flaky and did not cause any random
> failures.
> - Jenkins Python post-commit test is already separated from the other
> post-commit tests and will continue to exist. It would not change the
> testing time for any other test.
>
> * The release process needs to be updated to accommodate releasing Python
> artifacts. Python SDK would fit in the existing release schedule and could
> be released along with the Java SDK. The additional steps would include:
> - Generating Python artifacts. This could be done with a single command
> using Maven today.
> - Publishing the artifacts to a central repository such as PyPI.
>

I'm more than happy to help on this. We left on purpose some things open
when we added Maven support to the Python build.



> - Updating the release guide to reflect the changes above.
>
> * Users: There are existing users using the Python SDK. To give a rough
> estimate, a distribution of the Beam Python SDK had a total of 23K
> downloads in the past 6 months [6]. Some of those users are already engaged
> with the community (e.g. [7]). There might be an increased amount
> engagement from the rest of them after the merge.
>

Python 3 support is something we definitively need to look ahead. I'd try
to make the codebase compatible with both 2.7.x and 3.6.x, rather than
using other  solutions like 2to3.


Looking forward to hearing your thoughts and comments on “graduating”
> python-sdk to the master.
>
> Thank you,
> Ahmet
>
> (*) Python SDK branch currently has a diverse group of contributors.
> Regular contributors include Charles Chen, Chamikara Jayalath, María García
> Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
> Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions from
> Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and
> Younghee Kwon.
>
> [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
> [2] https://beam.apache.org/documentation/programming-guide/
> [3] https://issues.apache.org/jira/browse/BEAM-1265
> [4]
> https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
> en%20AND%20labels%20%3D%20sdk-consistency
> [5] https://issues.apache.org/jira/browse/BEAM-1218
> [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
> [7] https://issues.apache.org/jira/browse/BEAM-1251
>


Great summary, Ahmet. Thanks.

Cheers,

-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernandez@redlink.co
w: http://redlink.co

Re: [DISCUSS] Python SDK status and next steps

Posted by Frances Perry <fr...@apache.org>.
+1 merged after 0.5.

It's on a great trajectory in terms of development and community.

On Tue, Jan 17, 2017 at 5:48 PM, Kenneth Knowles <kl...@google.com.invalid>
wrote:

> Seems reasonable, and the timeline Davor suggests makes a lot of sense.
>
> On Tue, Jan 17, 2017 at 3:59 PM, Lukasz Cwik <lc...@google.com.invalid>
> wrote:
>
> > I'm also for merging to master.
> >
> > On Tue, Jan 17, 2017 at 3:39 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> > wrote:
> >
> > > It makes sense to merge after 0.5.0 release.
> > >
> > > Good point Davor: +1
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On 01/17/2017 03:34 PM, Davor Bonaci wrote:
> > >
> > >> +1. I think merging to master would be an awesome next step for the
> > Python
> > >> SDK.
> > >>
> > >> And, thanks for a great summary of the current state, roadmap, and
> > impact
> > >> to the project as a whole -- awesome!
> > >>
> > >> Process-wise, I'd suggest starting a formal vote once this discussion
> > >> seems
> > >> to be trending towards a conclusion, and complete the merge as soon as
> > the
> > >> next release (0.5.0) is cut. This would enable additional time before
> > >> 0.6.0
> > >> to figure out compliance, release process impact, etc.
> > >>
> > >> Great work everyone!
> > >>
> > >> On Tue, Jan 17, 2017 at 8:26 AM, Jean-Baptiste Onofré <
> jb@nanthrax.net>
> > >> wrote:
> > >>
> > >> Hi
> > >>>
> > >>> I didn't try the Python SDK recently but you provided a clear "state
> of
> > >>> the art". Anyway I'm in favor of merging things as quick as possible
> > >>> (assuming it's in a good shape in term of build, test, ...): it would
> > >>> potentially grow up the "external" contributions.
> > >>>
> > >>> So +1 from my side.
> > >>>
> > >>> Regards
> > >>> JB⁣​
> > >>>
> > >>> On Jan 17, 2017, 08:22, at 08:22, Ahmet Altay
> <altay@google.com.INVALID
> > >
> > >>> wrote:
> > >>>
> > >>>> Hi all,
> > >>>>
> > >>>> tl;dr: I would like to start a discussion about merging python-sdk
> > >>>> branch
> > >>>> to master branch. Python SDK is mature enough and merging it to
> master
> > >>>> will
> > >>>> accelerate its development and adoption.
> > >>>>
> > >>>> With a great effort from a lot of contributors(*), Python SDK [1] is
> > >>>> now a
> > >>>> mostly complete, tested, performant Python implementation of the
> Beam
> > >>>> model. Since June, when we first started with Python SDK in Apache
> > Beam
> > >>>> we
> > >>>> have been continuously improving it.
> > >>>>
> > >>>> ** Python SDK currently supports:
> > >>>>
> > >>>> * Model: All main concepts are present (ParDo, GroupByKey, Windowing
> > >>>> etc.).
> > >>>> * IO: There are extensible APIs for writing new bounded sources and
> > >>>> sinks.
> > >>>> Implementations are provided for Text, Avro, BigQuery, and
> Datastore.
> > >>>> * Runners: Python SDK has an extensible base runner module that
> allows
> > >>>> building specific runners on top of it. The SDK comes with two
> > pipeline
> > >>>> runners: DirectRunner and DataflowRunner; and it is possible to add
> > >>>> more.
> > >>>> The existing runners are currently limited to bounded execution and
> > >>>> otherwise equivalent to their Java SDK counterparts in
> functionality.
> > >>>> * Testing: Python SDK implements ValidatesRunner test framework for
> > >>>> implementing integration test for current and future runners. There
> is
> > >>>> unit
> > >>>> test coverage for all modules, and a number of integrations test for
> > >>>> validating existing runners.
> > >>>> * Documentation and examples: Documentation work has started on
> Python
> > >>>> SDK.
> > >>>> Beam Programming Guide page has been updated to include Python [2].
> > The
> > >>>> code comes with many ready to use examples and we are in a good
> place
> > >>>> to
> > >>>> start documenting those on the website.
> > >>>>
> > >>>> ** We are not done yet, next on the roadmap we have:
> > >>>>
> > >>>> * Streaming: Both of the existing runners lack support for streaming
> > >>>> execution, and currently there is work going on for adding streaming
> > >>>> support to DirectRunner [3].
> > >>>> * Documentation: Filling the rest of the Beam documentations with
> > >>>> Python
> > >>>> SDK specific information and examples.
> > >>>> * SDK consistency: Making Python SDK consistent with the Java SDK.
> We
> > >>>> have
> > >>>> come a long way on this and have only a few items left [4].
> > >>>> * Beamifying: We have been working on removing Dataflow-specific
> > >>>> references
> > >>>> both from the documentation and from the code. There is some work
> > left,
> > >>>> and
> > >>>> we are currently working on those as well [5].
> > >>>>
> > >>>> ** Steps and implications of merging to master:
> > >>>>
> > >>>> * Master branch is merged to python-sdk branch at regular intervals
> > and
> > >>>> the
> > >>>> last merge was on 12/22. All the past merges were uneventful because
> > >>>> there
> > >>>> is a minimal overlap in modified files between branches. Integrating
> > >>>> python-sdk to master will similarly touch a small number of existing
> > >>>> files.
> > >>>>
> > >>>> * Python SDK is using the same tools for building and testing. It is
> > >>>> already integrated with Maven, Jenkins and Travis. Specifically the
> > >>>> impact
> > >>>> to the testing infrastructure would be:
> > >>>> - There will be two additional test configurations in Travis. Since
> > >>>> Travis
> > >>>> runs all configurations in parallel there should not be a noticeable
> > >>>> change
> > >>>> in the Travis run time.
> > >>>> - Jenkins pre-commit test will start running the Python SDK tests.
> It
> > >>>> will
> > >>>> add an additional 5 minutes to the completion time of pre-commit
> test.
> > >>>> Historically Python SDK tests were not flaky and did not cause any
> > >>>> random
> > >>>> failures.
> > >>>> - Jenkins Python post-commit test is already separated from the
> other
> > >>>> post-commit tests and will continue to exist. It would not change
> the
> > >>>> testing time for any other test.
> > >>>>
> > >>>> * The release process needs to be updated to accommodate releasing
> > >>>> Python
> > >>>> artifacts. Python SDK would fit in the existing release schedule and
> > >>>> could
> > >>>> be released along with the Java SDK. The additional steps would
> > >>>> include:
> > >>>> - Generating Python artifacts. This could be done with a single
> > command
> > >>>> using Maven today.
> > >>>> - Publishing the artifacts to a central repository such as PyPI.
> > >>>> - Updating the release guide to reflect the changes above.
> > >>>>
> > >>>> * Users: There are existing users using the Python SDK. To give a
> > rough
> > >>>> estimate, a distribution of the Beam Python SDK had a total of 23K
> > >>>> downloads in the past 6 months [6]. Some of those users are already
> > >>>> engaged
> > >>>> with the community (e.g. [7]). There might be an increased amount
> > >>>> engagement from the rest of them after the merge.
> > >>>>
> > >>>> Looking forward to hearing your thoughts and comments on
> “graduating”
> > >>>> python-sdk to the master.
> > >>>>
> > >>>> Thank you,
> > >>>> Ahmet
> > >>>>
> > >>>> (*) Python SDK branch currently has a diverse group of contributors.
> > >>>> Regular contributors include Charles Chen, Chamikara Jayalath, María
> > >>>> García
> > >>>> Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
> > >>>> Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions
> > >>>> from
> > >>>> Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and
> > >>>> Younghee Kwon.
> > >>>>
> > >>>> [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
> > >>>> [2] https://beam.apache.org/documentation/programming-guide/
> > >>>> [3] https://issues.apache.org/jira/browse/BEAM-1265
> > >>>> [4]
> > >>>> https://issues.apache.org/jira/issues/?jql=status%20%3D%
> > >>>>
> > >>> 20Open%20AND%20labels%20%3D%20sdk-consistency
> > >>>
> > >>>> [5] https://issues.apache.org/jira/browse/BEAM-1218
> > >>>> [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
> > >>>> [7] https://issues.apache.org/jira/browse/BEAM-1251
> > >>>>
> > >>>
> > >>>
> > >>
> > > --
> > > Jean-Baptiste Onofré
> > > jbonofre@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>

Re: [DISCUSS] Python SDK status and next steps

Posted by Kenneth Knowles <kl...@google.com.INVALID>.
Seems reasonable, and the timeline Davor suggests makes a lot of sense.

On Tue, Jan 17, 2017 at 3:59 PM, Lukasz Cwik <lc...@google.com.invalid>
wrote:

> I'm also for merging to master.
>
> On Tue, Jan 17, 2017 at 3:39 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > It makes sense to merge after 0.5.0 release.
> >
> > Good point Davor: +1
> >
> > Regards
> > JB
> >
> >
> > On 01/17/2017 03:34 PM, Davor Bonaci wrote:
> >
> >> +1. I think merging to master would be an awesome next step for the
> Python
> >> SDK.
> >>
> >> And, thanks for a great summary of the current state, roadmap, and
> impact
> >> to the project as a whole -- awesome!
> >>
> >> Process-wise, I'd suggest starting a formal vote once this discussion
> >> seems
> >> to be trending towards a conclusion, and complete the merge as soon as
> the
> >> next release (0.5.0) is cut. This would enable additional time before
> >> 0.6.0
> >> to figure out compliance, release process impact, etc.
> >>
> >> Great work everyone!
> >>
> >> On Tue, Jan 17, 2017 at 8:26 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> >> wrote:
> >>
> >> Hi
> >>>
> >>> I didn't try the Python SDK recently but you provided a clear "state of
> >>> the art". Anyway I'm in favor of merging things as quick as possible
> >>> (assuming it's in a good shape in term of build, test, ...): it would
> >>> potentially grow up the "external" contributions.
> >>>
> >>> So +1 from my side.
> >>>
> >>> Regards
> >>> JB⁣​
> >>>
> >>> On Jan 17, 2017, 08:22, at 08:22, Ahmet Altay <altay@google.com.INVALID
> >
> >>> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> tl;dr: I would like to start a discussion about merging python-sdk
> >>>> branch
> >>>> to master branch. Python SDK is mature enough and merging it to master
> >>>> will
> >>>> accelerate its development and adoption.
> >>>>
> >>>> With a great effort from a lot of contributors(*), Python SDK [1] is
> >>>> now a
> >>>> mostly complete, tested, performant Python implementation of the Beam
> >>>> model. Since June, when we first started with Python SDK in Apache
> Beam
> >>>> we
> >>>> have been continuously improving it.
> >>>>
> >>>> ** Python SDK currently supports:
> >>>>
> >>>> * Model: All main concepts are present (ParDo, GroupByKey, Windowing
> >>>> etc.).
> >>>> * IO: There are extensible APIs for writing new bounded sources and
> >>>> sinks.
> >>>> Implementations are provided for Text, Avro, BigQuery, and Datastore.
> >>>> * Runners: Python SDK has an extensible base runner module that allows
> >>>> building specific runners on top of it. The SDK comes with two
> pipeline
> >>>> runners: DirectRunner and DataflowRunner; and it is possible to add
> >>>> more.
> >>>> The existing runners are currently limited to bounded execution and
> >>>> otherwise equivalent to their Java SDK counterparts in functionality.
> >>>> * Testing: Python SDK implements ValidatesRunner test framework for
> >>>> implementing integration test for current and future runners. There is
> >>>> unit
> >>>> test coverage for all modules, and a number of integrations test for
> >>>> validating existing runners.
> >>>> * Documentation and examples: Documentation work has started on Python
> >>>> SDK.
> >>>> Beam Programming Guide page has been updated to include Python [2].
> The
> >>>> code comes with many ready to use examples and we are in a good place
> >>>> to
> >>>> start documenting those on the website.
> >>>>
> >>>> ** We are not done yet, next on the roadmap we have:
> >>>>
> >>>> * Streaming: Both of the existing runners lack support for streaming
> >>>> execution, and currently there is work going on for adding streaming
> >>>> support to DirectRunner [3].
> >>>> * Documentation: Filling the rest of the Beam documentations with
> >>>> Python
> >>>> SDK specific information and examples.
> >>>> * SDK consistency: Making Python SDK consistent with the Java SDK. We
> >>>> have
> >>>> come a long way on this and have only a few items left [4].
> >>>> * Beamifying: We have been working on removing Dataflow-specific
> >>>> references
> >>>> both from the documentation and from the code. There is some work
> left,
> >>>> and
> >>>> we are currently working on those as well [5].
> >>>>
> >>>> ** Steps and implications of merging to master:
> >>>>
> >>>> * Master branch is merged to python-sdk branch at regular intervals
> and
> >>>> the
> >>>> last merge was on 12/22. All the past merges were uneventful because
> >>>> there
> >>>> is a minimal overlap in modified files between branches. Integrating
> >>>> python-sdk to master will similarly touch a small number of existing
> >>>> files.
> >>>>
> >>>> * Python SDK is using the same tools for building and testing. It is
> >>>> already integrated with Maven, Jenkins and Travis. Specifically the
> >>>> impact
> >>>> to the testing infrastructure would be:
> >>>> - There will be two additional test configurations in Travis. Since
> >>>> Travis
> >>>> runs all configurations in parallel there should not be a noticeable
> >>>> change
> >>>> in the Travis run time.
> >>>> - Jenkins pre-commit test will start running the Python SDK tests. It
> >>>> will
> >>>> add an additional 5 minutes to the completion time of pre-commit test.
> >>>> Historically Python SDK tests were not flaky and did not cause any
> >>>> random
> >>>> failures.
> >>>> - Jenkins Python post-commit test is already separated from the other
> >>>> post-commit tests and will continue to exist. It would not change the
> >>>> testing time for any other test.
> >>>>
> >>>> * The release process needs to be updated to accommodate releasing
> >>>> Python
> >>>> artifacts. Python SDK would fit in the existing release schedule and
> >>>> could
> >>>> be released along with the Java SDK. The additional steps would
> >>>> include:
> >>>> - Generating Python artifacts. This could be done with a single
> command
> >>>> using Maven today.
> >>>> - Publishing the artifacts to a central repository such as PyPI.
> >>>> - Updating the release guide to reflect the changes above.
> >>>>
> >>>> * Users: There are existing users using the Python SDK. To give a
> rough
> >>>> estimate, a distribution of the Beam Python SDK had a total of 23K
> >>>> downloads in the past 6 months [6]. Some of those users are already
> >>>> engaged
> >>>> with the community (e.g. [7]). There might be an increased amount
> >>>> engagement from the rest of them after the merge.
> >>>>
> >>>> Looking forward to hearing your thoughts and comments on “graduating”
> >>>> python-sdk to the master.
> >>>>
> >>>> Thank you,
> >>>> Ahmet
> >>>>
> >>>> (*) Python SDK branch currently has a diverse group of contributors.
> >>>> Regular contributors include Charles Chen, Chamikara Jayalath, María
> >>>> García
> >>>> Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
> >>>> Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions
> >>>> from
> >>>> Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and
> >>>> Younghee Kwon.
> >>>>
> >>>> [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
> >>>> [2] https://beam.apache.org/documentation/programming-guide/
> >>>> [3] https://issues.apache.org/jira/browse/BEAM-1265
> >>>> [4]
> >>>> https://issues.apache.org/jira/issues/?jql=status%20%3D%
> >>>>
> >>> 20Open%20AND%20labels%20%3D%20sdk-consistency
> >>>
> >>>> [5] https://issues.apache.org/jira/browse/BEAM-1218
> >>>> [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
> >>>> [7] https://issues.apache.org/jira/browse/BEAM-1251
> >>>>
> >>>
> >>>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: [DISCUSS] Python SDK status and next steps

Posted by Lukasz Cwik <lc...@google.com.INVALID>.
I'm also for merging to master.

On Tue, Jan 17, 2017 at 3:39 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> It makes sense to merge after 0.5.0 release.
>
> Good point Davor: +1
>
> Regards
> JB
>
>
> On 01/17/2017 03:34 PM, Davor Bonaci wrote:
>
>> +1. I think merging to master would be an awesome next step for the Python
>> SDK.
>>
>> And, thanks for a great summary of the current state, roadmap, and impact
>> to the project as a whole -- awesome!
>>
>> Process-wise, I'd suggest starting a formal vote once this discussion
>> seems
>> to be trending towards a conclusion, and complete the merge as soon as the
>> next release (0.5.0) is cut. This would enable additional time before
>> 0.6.0
>> to figure out compliance, release process impact, etc.
>>
>> Great work everyone!
>>
>> On Tue, Jan 17, 2017 at 8:26 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>
>> Hi
>>>
>>> I didn't try the Python SDK recently but you provided a clear "state of
>>> the art". Anyway I'm in favor of merging things as quick as possible
>>> (assuming it's in a good shape in term of build, test, ...): it would
>>> potentially grow up the "external" contributions.
>>>
>>> So +1 from my side.
>>>
>>> Regards
>>> JB⁣​
>>>
>>> On Jan 17, 2017, 08:22, at 08:22, Ahmet Altay <al...@google.com.INVALID>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> tl;dr: I would like to start a discussion about merging python-sdk
>>>> branch
>>>> to master branch. Python SDK is mature enough and merging it to master
>>>> will
>>>> accelerate its development and adoption.
>>>>
>>>> With a great effort from a lot of contributors(*), Python SDK [1] is
>>>> now a
>>>> mostly complete, tested, performant Python implementation of the Beam
>>>> model. Since June, when we first started with Python SDK in Apache Beam
>>>> we
>>>> have been continuously improving it.
>>>>
>>>> ** Python SDK currently supports:
>>>>
>>>> * Model: All main concepts are present (ParDo, GroupByKey, Windowing
>>>> etc.).
>>>> * IO: There are extensible APIs for writing new bounded sources and
>>>> sinks.
>>>> Implementations are provided for Text, Avro, BigQuery, and Datastore.
>>>> * Runners: Python SDK has an extensible base runner module that allows
>>>> building specific runners on top of it. The SDK comes with two pipeline
>>>> runners: DirectRunner and DataflowRunner; and it is possible to add
>>>> more.
>>>> The existing runners are currently limited to bounded execution and
>>>> otherwise equivalent to their Java SDK counterparts in functionality.
>>>> * Testing: Python SDK implements ValidatesRunner test framework for
>>>> implementing integration test for current and future runners. There is
>>>> unit
>>>> test coverage for all modules, and a number of integrations test for
>>>> validating existing runners.
>>>> * Documentation and examples: Documentation work has started on Python
>>>> SDK.
>>>> Beam Programming Guide page has been updated to include Python [2]. The
>>>> code comes with many ready to use examples and we are in a good place
>>>> to
>>>> start documenting those on the website.
>>>>
>>>> ** We are not done yet, next on the roadmap we have:
>>>>
>>>> * Streaming: Both of the existing runners lack support for streaming
>>>> execution, and currently there is work going on for adding streaming
>>>> support to DirectRunner [3].
>>>> * Documentation: Filling the rest of the Beam documentations with
>>>> Python
>>>> SDK specific information and examples.
>>>> * SDK consistency: Making Python SDK consistent with the Java SDK. We
>>>> have
>>>> come a long way on this and have only a few items left [4].
>>>> * Beamifying: We have been working on removing Dataflow-specific
>>>> references
>>>> both from the documentation and from the code. There is some work left,
>>>> and
>>>> we are currently working on those as well [5].
>>>>
>>>> ** Steps and implications of merging to master:
>>>>
>>>> * Master branch is merged to python-sdk branch at regular intervals and
>>>> the
>>>> last merge was on 12/22. All the past merges were uneventful because
>>>> there
>>>> is a minimal overlap in modified files between branches. Integrating
>>>> python-sdk to master will similarly touch a small number of existing
>>>> files.
>>>>
>>>> * Python SDK is using the same tools for building and testing. It is
>>>> already integrated with Maven, Jenkins and Travis. Specifically the
>>>> impact
>>>> to the testing infrastructure would be:
>>>> - There will be two additional test configurations in Travis. Since
>>>> Travis
>>>> runs all configurations in parallel there should not be a noticeable
>>>> change
>>>> in the Travis run time.
>>>> - Jenkins pre-commit test will start running the Python SDK tests. It
>>>> will
>>>> add an additional 5 minutes to the completion time of pre-commit test.
>>>> Historically Python SDK tests were not flaky and did not cause any
>>>> random
>>>> failures.
>>>> - Jenkins Python post-commit test is already separated from the other
>>>> post-commit tests and will continue to exist. It would not change the
>>>> testing time for any other test.
>>>>
>>>> * The release process needs to be updated to accommodate releasing
>>>> Python
>>>> artifacts. Python SDK would fit in the existing release schedule and
>>>> could
>>>> be released along with the Java SDK. The additional steps would
>>>> include:
>>>> - Generating Python artifacts. This could be done with a single command
>>>> using Maven today.
>>>> - Publishing the artifacts to a central repository such as PyPI.
>>>> - Updating the release guide to reflect the changes above.
>>>>
>>>> * Users: There are existing users using the Python SDK. To give a rough
>>>> estimate, a distribution of the Beam Python SDK had a total of 23K
>>>> downloads in the past 6 months [6]. Some of those users are already
>>>> engaged
>>>> with the community (e.g. [7]). There might be an increased amount
>>>> engagement from the rest of them after the merge.
>>>>
>>>> Looking forward to hearing your thoughts and comments on “graduating”
>>>> python-sdk to the master.
>>>>
>>>> Thank you,
>>>> Ahmet
>>>>
>>>> (*) Python SDK branch currently has a diverse group of contributors.
>>>> Regular contributors include Charles Chen, Chamikara Jayalath, María
>>>> García
>>>> Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
>>>> Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions
>>>> from
>>>> Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and
>>>> Younghee Kwon.
>>>>
>>>> [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
>>>> [2] https://beam.apache.org/documentation/programming-guide/
>>>> [3] https://issues.apache.org/jira/browse/BEAM-1265
>>>> [4]
>>>> https://issues.apache.org/jira/issues/?jql=status%20%3D%
>>>>
>>> 20Open%20AND%20labels%20%3D%20sdk-consistency
>>>
>>>> [5] https://issues.apache.org/jira/browse/BEAM-1218
>>>> [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
>>>> [7] https://issues.apache.org/jira/browse/BEAM-1251
>>>>
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [DISCUSS] Python SDK status and next steps

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
It makes sense to merge after 0.5.0 release.

Good point Davor: +1

Regards
JB

On 01/17/2017 03:34 PM, Davor Bonaci wrote:
> +1. I think merging to master would be an awesome next step for the Python
> SDK.
>
> And, thanks for a great summary of the current state, roadmap, and impact
> to the project as a whole -- awesome!
>
> Process-wise, I'd suggest starting a formal vote once this discussion seems
> to be trending towards a conclusion, and complete the merge as soon as the
> next release (0.5.0) is cut. This would enable additional time before 0.6.0
> to figure out compliance, release process impact, etc.
>
> Great work everyone!
>
> On Tue, Jan 17, 2017 at 8:26 AM, Jean-Baptiste Onofr� <jb...@nanthrax.net>
> wrote:
>
>> Hi
>>
>> I didn't try the Python SDK recently but you provided a clear "state of
>> the art". Anyway I'm in favor of merging things as quick as possible
>> (assuming it's in a good shape in term of build, test, ...): it would
>> potentially grow up the "external" contributions.
>>
>> So +1 from my side.
>>
>> Regards
>> JB\u2063\u200b
>>
>> On Jan 17, 2017, 08:22, at 08:22, Ahmet Altay <al...@google.com.INVALID>
>> wrote:
>>> Hi all,
>>>
>>> tl;dr: I would like to start a discussion about merging python-sdk
>>> branch
>>> to master branch. Python SDK is mature enough and merging it to master
>>> will
>>> accelerate its development and adoption.
>>>
>>> With a great effort from a lot of contributors(*), Python SDK [1] is
>>> now a
>>> mostly complete, tested, performant Python implementation of the Beam
>>> model. Since June, when we first started with Python SDK in Apache Beam
>>> we
>>> have been continuously improving it.
>>>
>>> ** Python SDK currently supports:
>>>
>>> * Model: All main concepts are present (ParDo, GroupByKey, Windowing
>>> etc.).
>>> * IO: There are extensible APIs for writing new bounded sources and
>>> sinks.
>>> Implementations are provided for Text, Avro, BigQuery, and Datastore.
>>> * Runners: Python SDK has an extensible base runner module that allows
>>> building specific runners on top of it. The SDK comes with two pipeline
>>> runners: DirectRunner and DataflowRunner; and it is possible to add
>>> more.
>>> The existing runners are currently limited to bounded execution and
>>> otherwise equivalent to their Java SDK counterparts in functionality.
>>> * Testing: Python SDK implements ValidatesRunner test framework for
>>> implementing integration test for current and future runners. There is
>>> unit
>>> test coverage for all modules, and a number of integrations test for
>>> validating existing runners.
>>> * Documentation and examples: Documentation work has started on Python
>>> SDK.
>>> Beam Programming Guide page has been updated to include Python [2]. The
>>> code comes with many ready to use examples and we are in a good place
>>> to
>>> start documenting those on the website.
>>>
>>> ** We are not done yet, next on the roadmap we have:
>>>
>>> * Streaming: Both of the existing runners lack support for streaming
>>> execution, and currently there is work going on for adding streaming
>>> support to DirectRunner [3].
>>> * Documentation: Filling the rest of the Beam documentations with
>>> Python
>>> SDK specific information and examples.
>>> * SDK consistency: Making Python SDK consistent with the Java SDK. We
>>> have
>>> come a long way on this and have only a few items left [4].
>>> * Beamifying: We have been working on removing Dataflow-specific
>>> references
>>> both from the documentation and from the code. There is some work left,
>>> and
>>> we are currently working on those as well [5].
>>>
>>> ** Steps and implications of merging to master:
>>>
>>> * Master branch is merged to python-sdk branch at regular intervals and
>>> the
>>> last merge was on 12/22. All the past merges were uneventful because
>>> there
>>> is a minimal overlap in modified files between branches. Integrating
>>> python-sdk to master will similarly touch a small number of existing
>>> files.
>>>
>>> * Python SDK is using the same tools for building and testing. It is
>>> already integrated with Maven, Jenkins and Travis. Specifically the
>>> impact
>>> to the testing infrastructure would be:
>>> - There will be two additional test configurations in Travis. Since
>>> Travis
>>> runs all configurations in parallel there should not be a noticeable
>>> change
>>> in the Travis run time.
>>> - Jenkins pre-commit test will start running the Python SDK tests. It
>>> will
>>> add an additional 5 minutes to the completion time of pre-commit test.
>>> Historically Python SDK tests were not flaky and did not cause any
>>> random
>>> failures.
>>> - Jenkins Python post-commit test is already separated from the other
>>> post-commit tests and will continue to exist. It would not change the
>>> testing time for any other test.
>>>
>>> * The release process needs to be updated to accommodate releasing
>>> Python
>>> artifacts. Python SDK would fit in the existing release schedule and
>>> could
>>> be released along with the Java SDK. The additional steps would
>>> include:
>>> - Generating Python artifacts. This could be done with a single command
>>> using Maven today.
>>> - Publishing the artifacts to a central repository such as PyPI.
>>> - Updating the release guide to reflect the changes above.
>>>
>>> * Users: There are existing users using the Python SDK. To give a rough
>>> estimate, a distribution of the Beam Python SDK had a total of 23K
>>> downloads in the past 6 months [6]. Some of those users are already
>>> engaged
>>> with the community (e.g. [7]). There might be an increased amount
>>> engagement from the rest of them after the merge.
>>>
>>> Looking forward to hearing your thoughts and comments on \u201cgraduating\u201d
>>> python-sdk to the master.
>>>
>>> Thank you,
>>> Ahmet
>>>
>>> (*) Python SDK branch currently has a diverse group of contributors.
>>> Regular contributors include Charles Chen, Chamikara Jayalath, Mar�a
>>> Garc�a
>>> Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
>>> Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions
>>> from
>>> Abdullah Bashir, Marco Buccini, Sergio Fern�ndez, Seunghyun Lee, and
>>> Younghee Kwon.
>>>
>>> [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
>>> [2] https://beam.apache.org/documentation/programming-guide/
>>> [3] https://issues.apache.org/jira/browse/BEAM-1265
>>> [4]
>>> https://issues.apache.org/jira/issues/?jql=status%20%3D%
>> 20Open%20AND%20labels%20%3D%20sdk-consistency
>>> [5] https://issues.apache.org/jira/browse/BEAM-1218
>>> [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
>>> [7] https://issues.apache.org/jira/browse/BEAM-1251
>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSS] Python SDK status and next steps

Posted by Davor Bonaci <da...@apache.org>.
+1. I think merging to master would be an awesome next step for the Python
SDK.

And, thanks for a great summary of the current state, roadmap, and impact
to the project as a whole -- awesome!

Process-wise, I'd suggest starting a formal vote once this discussion seems
to be trending towards a conclusion, and complete the merge as soon as the
next release (0.5.0) is cut. This would enable additional time before 0.6.0
to figure out compliance, release process impact, etc.

Great work everyone!

On Tue, Jan 17, 2017 at 8:26 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi
>
> I didn't try the Python SDK recently but you provided a clear "state of
> the art". Anyway I'm in favor of merging things as quick as possible
> (assuming it's in a good shape in term of build, test, ...): it would
> potentially grow up the "external" contributions.
>
> So +1 from my side.
>
> Regards
> JB⁣​
>
> On Jan 17, 2017, 08:22, at 08:22, Ahmet Altay <al...@google.com.INVALID>
> wrote:
> >Hi all,
> >
> >tl;dr: I would like to start a discussion about merging python-sdk
> >branch
> >to master branch. Python SDK is mature enough and merging it to master
> >will
> >accelerate its development and adoption.
> >
> >With a great effort from a lot of contributors(*), Python SDK [1] is
> >now a
> >mostly complete, tested, performant Python implementation of the Beam
> >model. Since June, when we first started with Python SDK in Apache Beam
> >we
> >have been continuously improving it.
> >
> >** Python SDK currently supports:
> >
> >* Model: All main concepts are present (ParDo, GroupByKey, Windowing
> >etc.).
> >* IO: There are extensible APIs for writing new bounded sources and
> >sinks.
> >Implementations are provided for Text, Avro, BigQuery, and Datastore.
> >* Runners: Python SDK has an extensible base runner module that allows
> >building specific runners on top of it. The SDK comes with two pipeline
> >runners: DirectRunner and DataflowRunner; and it is possible to add
> >more.
> >The existing runners are currently limited to bounded execution and
> >otherwise equivalent to their Java SDK counterparts in functionality.
> >* Testing: Python SDK implements ValidatesRunner test framework for
> >implementing integration test for current and future runners. There is
> >unit
> >test coverage for all modules, and a number of integrations test for
> >validating existing runners.
> >* Documentation and examples: Documentation work has started on Python
> >SDK.
> >Beam Programming Guide page has been updated to include Python [2]. The
> >code comes with many ready to use examples and we are in a good place
> >to
> >start documenting those on the website.
> >
> >** We are not done yet, next on the roadmap we have:
> >
> >* Streaming: Both of the existing runners lack support for streaming
> >execution, and currently there is work going on for adding streaming
> >support to DirectRunner [3].
> >* Documentation: Filling the rest of the Beam documentations with
> >Python
> >SDK specific information and examples.
> >* SDK consistency: Making Python SDK consistent with the Java SDK. We
> >have
> >come a long way on this and have only a few items left [4].
> >* Beamifying: We have been working on removing Dataflow-specific
> >references
> >both from the documentation and from the code. There is some work left,
> >and
> >we are currently working on those as well [5].
> >
> >** Steps and implications of merging to master:
> >
> >* Master branch is merged to python-sdk branch at regular intervals and
> >the
> >last merge was on 12/22. All the past merges were uneventful because
> >there
> >is a minimal overlap in modified files between branches. Integrating
> >python-sdk to master will similarly touch a small number of existing
> >files.
> >
> >* Python SDK is using the same tools for building and testing. It is
> >already integrated with Maven, Jenkins and Travis. Specifically the
> >impact
> >to the testing infrastructure would be:
> >- There will be two additional test configurations in Travis. Since
> >Travis
> >runs all configurations in parallel there should not be a noticeable
> >change
> >in the Travis run time.
> >- Jenkins pre-commit test will start running the Python SDK tests. It
> >will
> >add an additional 5 minutes to the completion time of pre-commit test.
> >Historically Python SDK tests were not flaky and did not cause any
> >random
> >failures.
> >- Jenkins Python post-commit test is already separated from the other
> >post-commit tests and will continue to exist. It would not change the
> >testing time for any other test.
> >
> >* The release process needs to be updated to accommodate releasing
> >Python
> >artifacts. Python SDK would fit in the existing release schedule and
> >could
> >be released along with the Java SDK. The additional steps would
> >include:
> >- Generating Python artifacts. This could be done with a single command
> >using Maven today.
> >- Publishing the artifacts to a central repository such as PyPI.
> >- Updating the release guide to reflect the changes above.
> >
> >* Users: There are existing users using the Python SDK. To give a rough
> >estimate, a distribution of the Beam Python SDK had a total of 23K
> >downloads in the past 6 months [6]. Some of those users are already
> >engaged
> >with the community (e.g. [7]). There might be an increased amount
> >engagement from the rest of them after the merge.
> >
> >Looking forward to hearing your thoughts and comments on “graduating”
> >python-sdk to the master.
> >
> >Thank you,
> >Ahmet
> >
> >(*) Python SDK branch currently has a diverse group of contributors.
> >Regular contributors include Charles Chen, Chamikara Jayalath, María
> >García
> >Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
> >Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions
> >from
> >Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and
> >Younghee Kwon.
> >
> >[1] https://github.com/apache/beam/tree/python-sdk/sdks/python
> >[2] https://beam.apache.org/documentation/programming-guide/
> >[3] https://issues.apache.org/jira/browse/BEAM-1265
> >[4]
> >https://issues.apache.org/jira/issues/?jql=status%20%3D%
> 20Open%20AND%20labels%20%3D%20sdk-consistency
> >[5] https://issues.apache.org/jira/browse/BEAM-1218
> >[6] https://pypi.python.org/pypi/google-cloud-dataflow/json
> >[7] https://issues.apache.org/jira/browse/BEAM-1251
>

Re: [DISCUSS] Python SDK status and next steps

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi

I didn't try the Python SDK recently but you provided a clear "state of the art". Anyway I'm in favor of merging things as quick as possible (assuming it's in a good shape in term of build, test, ...): it would potentially grow up the "external" contributions.

So +1 from my side.

Regards
JB\u2063\u200b

On Jan 17, 2017, 08:22, at 08:22, Ahmet Altay <al...@google.com.INVALID> wrote:
>Hi all,
>
>tl;dr: I would like to start a discussion about merging python-sdk
>branch
>to master branch. Python SDK is mature enough and merging it to master
>will
>accelerate its development and adoption.
>
>With a great effort from a lot of contributors(*), Python SDK [1] is
>now a
>mostly complete, tested, performant Python implementation of the Beam
>model. Since June, when we first started with Python SDK in Apache Beam
>we
>have been continuously improving it.
>
>** Python SDK currently supports:
>
>* Model: All main concepts are present (ParDo, GroupByKey, Windowing
>etc.).
>* IO: There are extensible APIs for writing new bounded sources and
>sinks.
>Implementations are provided for Text, Avro, BigQuery, and Datastore.
>* Runners: Python SDK has an extensible base runner module that allows
>building specific runners on top of it. The SDK comes with two pipeline
>runners: DirectRunner and DataflowRunner; and it is possible to add
>more.
>The existing runners are currently limited to bounded execution and
>otherwise equivalent to their Java SDK counterparts in functionality.
>* Testing: Python SDK implements ValidatesRunner test framework for
>implementing integration test for current and future runners. There is
>unit
>test coverage for all modules, and a number of integrations test for
>validating existing runners.
>* Documentation and examples: Documentation work has started on Python
>SDK.
>Beam Programming Guide page has been updated to include Python [2]. The
>code comes with many ready to use examples and we are in a good place
>to
>start documenting those on the website.
>
>** We are not done yet, next on the roadmap we have:
>
>* Streaming: Both of the existing runners lack support for streaming
>execution, and currently there is work going on for adding streaming
>support to DirectRunner [3].
>* Documentation: Filling the rest of the Beam documentations with
>Python
>SDK specific information and examples.
>* SDK consistency: Making Python SDK consistent with the Java SDK. We
>have
>come a long way on this and have only a few items left [4].
>* Beamifying: We have been working on removing Dataflow-specific
>references
>both from the documentation and from the code. There is some work left,
>and
>we are currently working on those as well [5].
>
>** Steps and implications of merging to master:
>
>* Master branch is merged to python-sdk branch at regular intervals and
>the
>last merge was on 12/22. All the past merges were uneventful because
>there
>is a minimal overlap in modified files between branches. Integrating
>python-sdk to master will similarly touch a small number of existing
>files.
>
>* Python SDK is using the same tools for building and testing. It is
>already integrated with Maven, Jenkins and Travis. Specifically the
>impact
>to the testing infrastructure would be:
>- There will be two additional test configurations in Travis. Since
>Travis
>runs all configurations in parallel there should not be a noticeable
>change
>in the Travis run time.
>- Jenkins pre-commit test will start running the Python SDK tests. It
>will
>add an additional 5 minutes to the completion time of pre-commit test.
>Historically Python SDK tests were not flaky and did not cause any
>random
>failures.
>- Jenkins Python post-commit test is already separated from the other
>post-commit tests and will continue to exist. It would not change the
>testing time for any other test.
>
>* The release process needs to be updated to accommodate releasing
>Python
>artifacts. Python SDK would fit in the existing release schedule and
>could
>be released along with the Java SDK. The additional steps would
>include:
>- Generating Python artifacts. This could be done with a single command
>using Maven today.
>- Publishing the artifacts to a central repository such as PyPI.
>- Updating the release guide to reflect the changes above.
>
>* Users: There are existing users using the Python SDK. To give a rough
>estimate, a distribution of the Beam Python SDK had a total of 23K
>downloads in the past 6 months [6]. Some of those users are already
>engaged
>with the community (e.g. [7]). There might be an increased amount
>engagement from the rest of them after the merge.
>
>Looking forward to hearing your thoughts and comments on \u201cgraduating\u201d
>python-sdk to the master.
>
>Thank you,
>Ahmet
>
>(*) Python SDK branch currently has a diverse group of contributors.
>Regular contributors include Charles Chen, Chamikara Jayalath, Mar�a
>Garc�a
>Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
>Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions
>from
>Abdullah Bashir, Marco Buccini, Sergio Fern�ndez, Seunghyun Lee, and
>Younghee Kwon.
>
>[1] https://github.com/apache/beam/tree/python-sdk/sdks/python
>[2] https://beam.apache.org/documentation/programming-guide/
>[3] https://issues.apache.org/jira/browse/BEAM-1265
>[4]
>https://issues.apache.org/jira/issues/?jql=status%20%3D%20Open%20AND%20labels%20%3D%20sdk-consistency
>[5] https://issues.apache.org/jira/browse/BEAM-1218
>[6] https://pypi.python.org/pypi/google-cloud-dataflow/json
>[7] https://issues.apache.org/jira/browse/BEAM-1251