You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Xianda Ke <ke...@gmail.com> on 2018/12/13 08:57:48 UTC

[DISCUSS] Python (and Non-JVM) Language Support in Flink

Currently there is an ongoing survey about Python usage of Flink [1]. Some
discussion was also brought up there regarding non-jvm language support
strategy in general. To avoid polluting the survey thread, we are starting
this discussion thread and would like to move the discussions here.

In the interest of facilitating the discussion, we would like to first
share the following design doc which describes what we have done at Alibaba
about Python API for Flink. It could serve as a good reference to the
discussion.

[DISCUSS] Flink Python API
<https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web>

As of now, we've implemented and delivered Python UDF for SQL for the
internal users at Alibaba.
We are starting to implement Python API.

To recap and continue the discussion from the survey thread, I agree with
@Stephan that we should figure out in which general direction Python
support should go. Stephan also list three options there:
* Option (1): Language portability via Apache Beam
* Option (2): Implement own Python API
* Option (3): Implement own portability layer

From my perspective,
(1). Flink language APIs and Beam's languages support are not mutually
exclusive.
It is nice that Beam has Python/NodeJS/Go APIs, and support Flink as the
runner.
Flink's own Python(or NodeJS/Go) APIs will benefit Flink's ecosystem.

(2). Python API / portability layer
To support non-JVM languages in Flink,
* at client side, Flink would provide language interfaces, which will
translate user's application to Flink StreamGraph.
* at server side, Flink would execute user's UDF code at runtime
The non-JVM languages communicate with JVM via RPC(or low-level socket,
embedded interpreter and so on). What the portability layer can do maybe is
abstracting the RPC layer. When the portability layer is ready, still there
are lots of stuff to do for a specified language. Say, Python, we may still
have to write the interface classes by hand for the users because generated
code without detailed documentation is unacceptable for users, or handle
the serialization issue of lambda/closure which is not a built-in feature
in Python. Maybe, we can start with Python API, then extend to other
languages and abstract the logic in common as the portability layer.

---
References:
[1] [SURVEY] Usage of flink-python and flink-streaming-python

Regards,
Xianda

Re: [DISCUSS] Python (and Non-JVM) Language Support in Flink

Posted by jincheng sun <su...@gmail.com>.

Thanks for your feedback Vino, Jeff!

I have started new threading outlining what we are proposing in Python
Table API.

[DISCUSS] FLIP-38 Support python language in flink TableAPI can be found
here:

http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-38-Support-python-language-in-flink-TableAPI-td28061.html

Best,
Jincheng

Jeff Zhang <zj...@gmail.com> 于2019年3月28日周四 下午4:59写道：

> Hi Shaoxuan & Jincheng,
>
> Thanks for driving this initiative. Python would be a very big add-on for
> flink adoption in data science world. One additional suggestion is you may
> need to think about how to transfer flink Table to pandas dataframe which
> is a very popular library in python. And you may be interested in apache
> arrow which is a common layer to transferring data efficiently across
> languages. https://arrow.apache.org/
>
>
>
>
>
>
> vino yang <ya...@gmail.com> 于2019年3月28日周四 下午2:44写道：
>
>> Hi jincheng,
>>
>> Thanks for activating this discussion again.
>> I personally look forward to your design draft.
>>
>> Best,
>> Vino
>>
>> jincheng sun <su...@gmail.com> 于2019年3月28日周四 下午12:16写道：
>>
>> > Hi everyone,
>> > Sorry to join in this discussion late.
>> >
>> > Thanks to Xianda Ke for initiating this discussion. I also enjoy the
>> > discussions&suggestions by Max, Austin, Thomas, Shaoxuan and others.
>> >
>> > Recently, I did feel the desire of the community and Flink users for
>> Python
>> > support. Stephan also pointed out in the discussion of `Adding a
>> mid-term
>> > roadmap`: "Table API becomes primary API for analytics use cases",
>> while a
>> > large number of users in analytics use cases are accustomed to the
>> Python
>> > language, and the accumulation of a large number of class libraries is
>> also
>> > deposited in the python library.
>> >
>> > So I am very interested in participating in the discussion of supporting
>> > Python in Flink. With regard to the three options mentioned so far, it
>> is a
>> > great encouragement to leverage the beam’s language portable layer on
>> > Flink. For now, we can start with a step in the Flink to add a
>> Py-tableAPI.
>> > I believe in, in this process, we will have a deeper understanding of
>> how
>> > Flink support python. If we can quickly let users experience the first
>> > version of Flink Python TableAPI, we can also receive feedback from many
>> > users, and consider the long-term goals of multi-language support on
>> Flink.
>> >
>> > So if you agree, I volunteer to draft a document that would support the
>> > detailed design and implementation plan of Py-TableAPI on Flink.
>> >
>> > What do you think?
>> >
>> > Shaoxuan Wang <ws...@gmail.com> 于2019年2月21日周四 下午10:44写道：
>> >
>> > > Hey guys,
>> > > Thanks for your comments and sorry for the late reply.
>> > > Beam Python API and Flink Python TableAPI describe the DAG/pipeline in
>> > > different manners. We got a chance to communicate with Tyler Akidau
>> (from
>> > > Beam) offline, and explained why the Flink tableAPI needs a specific
>> > design
>> > > for python, rather than purely leverage Beam portability layer.
>> > >
>> > > In our proposal, most of the Python code is just a DAG/pipeline
>> builder
>> > for
>> > > tableAPI. The majority of operators run purely in Java, while only
>> python
>> > > UDFs executed in Python environment during the runtime. This design
>> does
>> > > not affect the development and adoption of Beam language portability
>> > layer
>> > > with Flink runner. Flink and Beam community will still collaborate,
>> > jointly
>> > > develop and optimize on the JVM / Non-JVM (python,GO) bridge (data and
>> > > control connections between different processes) to ensure the
>> robustness
>> > > and performance.
>> > >
>> > > Regards,
>> > > Shaoxuan
>> > >
>> > >
>> > > On Fri, Dec 21, 2018 at 1:39 PM Thomas Weise <th...@apache.org> wrote:
>> > >
>> > > > Interest in Python seems on the rise and so this is a good
>> discussion
>> > to
>> > > > have :)
>> > > >
>> > > > So far there seems to be agreement that Beam's approach towards
>> Python
>> > > and
>> > > > other non-JVM language support (language SDK, portability layer
>> etc.)
>> > is
>> > > > the right direction? Specification and execution are native Python
>> and
>> > it
>> > > > does not suffer from the shortcomings of Flink's Jython API and few
>> > other
>> > > > approaches.
>> > > >
>> > > > Overall there already is good alignment between Beam and Flink in
>> > > concepts
>> > > > and model. There are also few of us that are active in both
>> > communities.
>> > > > The Beam Flink runner has made a lot of progress this year, but
>> work on
>> > > > portability in Beam actually started much before that and was a very
>> > big
>> > > > change (originally there was just the Java SDK). Much of the code
>> has
>> > > been
>> > > > rewritten as part of the effort; that's what implementing a strong
>> > multi
>> > > > language support story took. To have a decent shot at it, the
>> > equivalent
>> > > of
>> > > > much of the Beam portability framework would need to be reinvented
>> in
>> > > > Flink. This would fork resources and divert focus away from things
>> that
>> > > may
>> > > > be more core to Flink. As you can guess I am in favor of option (1)
>> !
>> > > >
>> > > > We could take a look at SQL for reference. Flink community has
>> > invested a
>> > > > lot in SQL and there remains a lot of work to do. Beam community has
>> > done
>> > > > the same and we have two completely separate implementations. When I
>> > > > recently learned more about the Beam SQL work, one of my first
>> > questions
>> > > > was if joined effort would not lead to better user value? Calcite is
>> > > > common, but isn't there much more that could be shared? Someone had
>> the
>> > > > idea that in such a world Flink could just substitute portions or
>> all
>> > of
>> > > > the graph provided by Beam with it's own optimized version but much
>> of
>> > > the
>> > > > tooling could be same?
>> > > >
>> > > > IO connectors are another area where much effort is repeated. It
>> takes
>> > a
>> > > > very long time to arrive at a solid, production quality
>> implementation
>> > > > (typically resulting from broad user exposure and running at scale).
>> > > > Currently there is discussion how connectors can be done much
>> better in
>> > > > both projects: SDF in Beam [1] and FLIP-27.
>> > > >
>> > > > It's a trade-off, but more synergy would be great!
>> > > >
>> > > > Thomas
>> > > >
>> > > > [1]
>> > > >
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/
>> > > >
>> > > >
>> > > > On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett <
>> > > > whatwouldaustindo@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Hi Shaoxuan,
>> > > > >
>> > > > > FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the
>> Bay
>> > > Area
>> > > > > Apache Beam Meetup[1] which included a bit on a vision for how
>> Beam
>> > > could
>> > > > > better leverage runner specific optimizations -- as an
>> > > example/extension,
>> > > > > Beam SQL leveraging Flink specific SQL optimizations (to address
>> your
>> > > > > point).  So, that is part of the eventual roadmap for Beam, and
>> > > > illustrates
>> > > > > how concrete efforts towards optimizations in Runner/SDK-Harness
>> > would
>> > > > > likely yield the desired result of cross-language support (which
>> > could
>> > > be
>> > > > > done by leveraging Beam, and devote focus to optimizing that
>> > processing
>> > > > on
>> > > > > Flink).
>> > > > >
>> > > > > Cheers,
>> > > > > Austin
>> > > > >
>> > > > >
>> > > > > [1]
>> > https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/
>> > > > --
>> > > > > I
>> > > > > can post/share videos once available should someone desire.
>> > > > >
>> > > > > On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <
>> mxm@apache.org>
>> > > > wrote:
>> > > > >
>> > > > > > Hi Xianda, hi Shaoxuan,
>> > > > > >
>> > > > > > I'd be in favor of option (1). There is great potential in Beam
>> and
>> > > > Flink
>> > > > > > joining forces on this one. Here's why:
>> > > > > >
>> > > > > > The Beam project spent at least a year developing a portability
>> > layer
>> > > > > with
>> > > > > > a
>> > > > > > reasonable amount of people working on it. Developing a new
>> > > portability
>> > > > > > layer
>> > > > > > from scratch will probably take about the same amount of time
>> and
>> > > > > > resources.
>> > > > > >
>> > > > > > Concerning option (2): There is already a Python API for Flink
>> but
>> > an
>> > > > API
>> > > > > > is
>> > > > > > only one part of the portability story. In Beam the portability
>> is
>> > > > > > structured
>> > > > > > into three components:
>> > > > > >
>> > > > > > - SDK (API, its Protobuf serialization, and interaction with the
>> > SDK
>> > > > > > Harness)
>> > > > > > - Runner (Translation from Protobuf pipeline to Flink job)
>> > > > > > - SDK Harness (UDF execution, Interaction with the SDK and the
>> > > > execution
>> > > > > > engine)
>> > > > > >
>> > > > > > I could imagine the Flink Python API would be another SDK which
>> > could
>> > > > > have
>> > > > > > its
>> > > > > > own API but would reuse code for the interaction with the SDK
>> > > Harness.
>> > > > > >
>> > > > > > We would be able to focus on the optimizations instead of
>> > rebuilding
>> > > a
>> > > > > > portability layer from scratch.
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Max
>> > > > > >
>> > > > > > On 13.12.18 11:52, Shaoxuan Wang wrote:
>> > > > > > > RE: Stephen's options (
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html
>> > > > > > > )
>> > > > > > > * Option (1): Language portability via Apache Beam
>> > > > > > > * Option (2): Implement own Python API
>> > > > > > > * Option (3): Implement own portability layer
>> > > > > > >
>> > > > > > > Hi Stephen,
>> > > > > > > Eventually, I think we should support both option1 and
>> option3.
>> > > TMO,
>> > > > > > these
>> > > > > > > two options are orthogonal. I agree with you that we can
>> leverage
>> > > the
>> > > > > > > existing work and ecosystem in beam by supporting option1. But
>> > the
>> > > > > > problem
>> > > > > > > of beam is that it skips (to the best of my knowledge) the
>> > natural
>> > > > > > > table/SQL optimization framework provided by Flink. We should
>> > spend
>> > > > all
>> > > > > > the
>> > > > > > > needed efforts to support solution1 (as it is the better
>> > > alternative
>> > > > of
>> > > > > > the
>> > > > > > > current Flink python API), but cannot solely bet on it.
>> Option3
>> > is
>> > > > the
>> > > > > > > ideal choice for Flink to support all Non-JVM languages which
>> we
>> > > > should
>> > > > > > > better plan to achieve. We have done some preliminary
>> prototypes
>> > > for
>> > > > > > > option2/option3, and it seems not quite complex and difficult
>> to
>> > > > > > accomplish.
>> > > > > > >
>> > > > > > > Regards,
>> > > > > > > Shaoxuan
>> > > > > > >
>> > > > > > >
>> > > > > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <kexianda@gmail.com
>> >
>> > > > wrote:
>> > > > > > >
>> > > > > > >> Currently there is an ongoing survey about Python usage of
>> Flink
>> > > > [1].
>> > > > > > Some
>> > > > > > >> discussion was also brought up there regarding non-jvm
>> language
>> > > > > support
>> > > > > > >> strategy in general. To avoid polluting the survey thread, we
>> > are
>> > > > > > starting
>> > > > > > >> this discussion thread and would like to move the discussions
>> > > here.
>> > > > > > >>
>> > > > > > >> In the interest of facilitating the discussion, we would
>> like to
>> > > > first
>> > > > > > >> share the following design doc which describes what we have
>> done
>> > > at
>> > > > > > Alibaba
>> > > > > > >> about Python API for Flink. It could serve as a good
>> reference
>> > to
>> > > > the
>> > > > > > >> discussion.
>> > > > > > >>
>> > > > > > >>   [DISCUSS] Flink Python API
>> > > > > > >> <
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web
>> > > > > > >>>
>> > > > > > >>
>> > > > > > >> As of now, we've implemented and delivered Python UDF for SQL
>> > for
>> > > > the
>> > > > > > >> internal users at Alibaba.
>> > > > > > >> We are starting to implement Python API.
>> > > > > > >>
>> > > > > > >> To recap and continue the discussion from the survey thread,
>> I
>> > > agree
>> > > > > > with
>> > > > > > >> @Stephan that we should figure out in which general direction
>> > > Python
>> > > > > > >> support should go. Stephan also list three options there:
>> > > > > > >> * Option (1): Language portability via Apache Beam
>> > > > > > >> * Option (2): Implement own Python API
>> > > > > > >> * Option (3): Implement own portability layer
>> > > > > > >>
>> > > > > > >>  From my perspective,
>> > > > > > >> (1). Flink language APIs and Beam's languages support are not
>> > > > mutually
>> > > > > > >> exclusive.
>> > > > > > >> It is nice that Beam has Python/NodeJS/Go APIs, and support
>> > Flink
>> > > as
>> > > > > the
>> > > > > > >> runner.
>> > > > > > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's
>> > > > ecosystem.
>> > > > > > >>
>> > > > > > >> (2). Python API / portability layer
>> > > > > > >> To support non-JVM languages in Flink,
>> > > > > > >>   * at client side, Flink would provide language interfaces,
>> > which
>> > > > > will
>> > > > > > >> translate user's application to Flink StreamGraph.
>> > > > > > >> * at server side, Flink would execute user's UDF code at
>> runtime
>> > > > > > >> The non-JVM languages communicate with JVM via RPC(or
>> low-level
>> > > > > socket,
>> > > > > > >> embedded interpreter and so on). What the portability layer
>> can
>> > do
>> > > > > > maybe is
>> > > > > > >> abstracting the RPC layer. When the portability layer is
>> ready,
>> > > > still
>> > > > > > there
>> > > > > > >> are lots of stuff to do for a specified language. Say,
>> Python,
>> > we
>> > > > may
>> > > > > > still
>> > > > > > >> have to write the interface classes by hand for the users
>> > because
>> > > > > > generated
>> > > > > > >> code without detailed documentation is unacceptable for
>> users,
>> > or
>> > > > > handle
>> > > > > > >> the serialization issue of lambda/closure which is not a
>> > built-in
>> > > > > > feature
>> > > > > > >> in Python.  Maybe, we can start with Python API, then extend
>> to
>> > > > other
>> > > > > > >> languages and abstract the logic in common as the portability
>> > > layer.
>> > > > > > >>
>> > > > > > >> ---
>> > > > > > >> References:
>> > > > > > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python
>> > > > > > >>
>> > > > > > >> Regards,
>> > > > > > >> Xianda
>> > > > > > >>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: [DISCUSS] Python (and Non-JVM) Language Support in Flink

Posted by Jeff Zhang <zj...@gmail.com>.

Hi Shaoxuan & Jincheng,

Thanks for driving this initiative. Python would be a very big add-on for
flink adoption in data science world. One additional suggestion is you may
need to think about how to transfer flink Table to pandas dataframe which
is a very popular library in python. And you may be interested in apache
arrow which is a common layer to transferring data efficiently across
languages. https://arrow.apache.org/






vino yang <ya...@gmail.com> 于2019年3月28日周四 下午2:44写道：

> Hi jincheng,
>
> Thanks for activating this discussion again.
> I personally look forward to your design draft.
>
> Best,
> Vino
>
> jincheng sun <su...@gmail.com> 于2019年3月28日周四 下午12:16写道：
>
> > Hi everyone,
> > Sorry to join in this discussion late.
> >
> > Thanks to Xianda Ke for initiating this discussion. I also enjoy the
> > discussions&suggestions by Max, Austin, Thomas, Shaoxuan and others.
> >
> > Recently, I did feel the desire of the community and Flink users for
> Python
> > support. Stephan also pointed out in the discussion of `Adding a mid-term
> > roadmap`: "Table API becomes primary API for analytics use cases", while
> a
> > large number of users in analytics use cases are accustomed to the Python
> > language, and the accumulation of a large number of class libraries is
> also
> > deposited in the python library.
> >
> > So I am very interested in participating in the discussion of supporting
> > Python in Flink. With regard to the three options mentioned so far, it
> is a
> > great encouragement to leverage the beam’s language portable layer on
> > Flink. For now, we can start with a step in the Flink to add a
> Py-tableAPI.
> > I believe in, in this process, we will have a deeper understanding of how
> > Flink support python. If we can quickly let users experience the first
> > version of Flink Python TableAPI, we can also receive feedback from many
> > users, and consider the long-term goals of multi-language support on
> Flink.
> >
> > So if you agree, I volunteer to draft a document that would support the
> > detailed design and implementation plan of Py-TableAPI on Flink.
> >
> > What do you think?
> >
> > Shaoxuan Wang <ws...@gmail.com> 于2019年2月21日周四 下午10:44写道：
> >
> > > Hey guys,
> > > Thanks for your comments and sorry for the late reply.
> > > Beam Python API and Flink Python TableAPI describe the DAG/pipeline in
> > > different manners. We got a chance to communicate with Tyler Akidau
> (from
> > > Beam) offline, and explained why the Flink tableAPI needs a specific
> > design
> > > for python, rather than purely leverage Beam portability layer.
> > >
> > > In our proposal, most of the Python code is just a DAG/pipeline builder
> > for
> > > tableAPI. The majority of operators run purely in Java, while only
> python
> > > UDFs executed in Python environment during the runtime. This design
> does
> > > not affect the development and adoption of Beam language portability
> > layer
> > > with Flink runner. Flink and Beam community will still collaborate,
> > jointly
> > > develop and optimize on the JVM / Non-JVM (python,GO) bridge (data and
> > > control connections between different processes) to ensure the
> robustness
> > > and performance.
> > >
> > > Regards,
> > > Shaoxuan
> > >
> > >
> > > On Fri, Dec 21, 2018 at 1:39 PM Thomas Weise <th...@apache.org> wrote:
> > >
> > > > Interest in Python seems on the rise and so this is a good discussion
> > to
> > > > have :)
> > > >
> > > > So far there seems to be agreement that Beam's approach towards
> Python
> > > and
> > > > other non-JVM language support (language SDK, portability layer etc.)
> > is
> > > > the right direction? Specification and execution are native Python
> and
> > it
> > > > does not suffer from the shortcomings of Flink's Jython API and few
> > other
> > > > approaches.
> > > >
> > > > Overall there already is good alignment between Beam and Flink in
> > > concepts
> > > > and model. There are also few of us that are active in both
> > communities.
> > > > The Beam Flink runner has made a lot of progress this year, but work
> on
> > > > portability in Beam actually started much before that and was a very
> > big
> > > > change (originally there was just the Java SDK). Much of the code has
> > > been
> > > > rewritten as part of the effort; that's what implementing a strong
> > multi
> > > > language support story took. To have a decent shot at it, the
> > equivalent
> > > of
> > > > much of the Beam portability framework would need to be reinvented in
> > > > Flink. This would fork resources and divert focus away from things
> that
> > > may
> > > > be more core to Flink. As you can guess I am in favor of option (1) !
> > > >
> > > > We could take a look at SQL for reference. Flink community has
> > invested a
> > > > lot in SQL and there remains a lot of work to do. Beam community has
> > done
> > > > the same and we have two completely separate implementations. When I
> > > > recently learned more about the Beam SQL work, one of my first
> > questions
> > > > was if joined effort would not lead to better user value? Calcite is
> > > > common, but isn't there much more that could be shared? Someone had
> the
> > > > idea that in such a world Flink could just substitute portions or all
> > of
> > > > the graph provided by Beam with it's own optimized version but much
> of
> > > the
> > > > tooling could be same?
> > > >
> > > > IO connectors are another area where much effort is repeated. It
> takes
> > a
> > > > very long time to arrive at a solid, production quality
> implementation
> > > > (typically resulting from broad user exposure and running at scale).
> > > > Currently there is discussion how connectors can be done much better
> in
> > > > both projects: SDF in Beam [1] and FLIP-27.
> > > >
> > > > It's a trade-off, but more synergy would be great!
> > > >
> > > > Thomas
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/
> > > >
> > > >
> > > > On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett <
> > > > whatwouldaustindo@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Shaoxuan,
> > > > >
> > > > > FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the Bay
> > > Area
> > > > > Apache Beam Meetup[1] which included a bit on a vision for how Beam
> > > could
> > > > > better leverage runner specific optimizations -- as an
> > > example/extension,
> > > > > Beam SQL leveraging Flink specific SQL optimizations (to address
> your
> > > > > point).  So, that is part of the eventual roadmap for Beam, and
> > > > illustrates
> > > > > how concrete efforts towards optimizations in Runner/SDK-Harness
> > would
> > > > > likely yield the desired result of cross-language support (which
> > could
> > > be
> > > > > done by leveraging Beam, and devote focus to optimizing that
> > processing
> > > > on
> > > > > Flink).
> > > > >
> > > > > Cheers,
> > > > > Austin
> > > > >
> > > > >
> > > > > [1]
> > https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/
> > > > --
> > > > > I
> > > > > can post/share videos once available should someone desire.
> > > > >
> > > > > On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <mxm@apache.org
> >
> > > > wrote:
> > > > >
> > > > > > Hi Xianda, hi Shaoxuan,
> > > > > >
> > > > > > I'd be in favor of option (1). There is great potential in Beam
> and
> > > > Flink
> > > > > > joining forces on this one. Here's why:
> > > > > >
> > > > > > The Beam project spent at least a year developing a portability
> > layer
> > > > > with
> > > > > > a
> > > > > > reasonable amount of people working on it. Developing a new
> > > portability
> > > > > > layer
> > > > > > from scratch will probably take about the same amount of time and
> > > > > > resources.
> > > > > >
> > > > > > Concerning option (2): There is already a Python API for Flink
> but
> > an
> > > > API
> > > > > > is
> > > > > > only one part of the portability story. In Beam the portability
> is
> > > > > > structured
> > > > > > into three components:
> > > > > >
> > > > > > - SDK (API, its Protobuf serialization, and interaction with the
> > SDK
> > > > > > Harness)
> > > > > > - Runner (Translation from Protobuf pipeline to Flink job)
> > > > > > - SDK Harness (UDF execution, Interaction with the SDK and the
> > > > execution
> > > > > > engine)
> > > > > >
> > > > > > I could imagine the Flink Python API would be another SDK which
> > could
> > > > > have
> > > > > > its
> > > > > > own API but would reuse code for the interaction with the SDK
> > > Harness.
> > > > > >
> > > > > > We would be able to focus on the optimizations instead of
> > rebuilding
> > > a
> > > > > > portability layer from scratch.
> > > > > >
> > > > > > Thanks,
> > > > > > Max
> > > > > >
> > > > > > On 13.12.18 11:52, Shaoxuan Wang wrote:
> > > > > > > RE: Stephen's options (
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html
> > > > > > > )
> > > > > > > * Option (1): Language portability via Apache Beam
> > > > > > > * Option (2): Implement own Python API
> > > > > > > * Option (3): Implement own portability layer
> > > > > > >
> > > > > > > Hi Stephen,
> > > > > > > Eventually, I think we should support both option1 and option3.
> > > TMO,
> > > > > > these
> > > > > > > two options are orthogonal. I agree with you that we can
> leverage
> > > the
> > > > > > > existing work and ecosystem in beam by supporting option1. But
> > the
> > > > > > problem
> > > > > > > of beam is that it skips (to the best of my knowledge) the
> > natural
> > > > > > > table/SQL optimization framework provided by Flink. We should
> > spend
> > > > all
> > > > > > the
> > > > > > > needed efforts to support solution1 (as it is the better
> > > alternative
> > > > of
> > > > > > the
> > > > > > > current Flink python API), but cannot solely bet on it. Option3
> > is
> > > > the
> > > > > > > ideal choice for Flink to support all Non-JVM languages which
> we
> > > > should
> > > > > > > better plan to achieve. We have done some preliminary
> prototypes
> > > for
> > > > > > > option2/option3, and it seems not quite complex and difficult
> to
> > > > > > accomplish.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Shaoxuan
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <ke...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > >> Currently there is an ongoing survey about Python usage of
> Flink
> > > > [1].
> > > > > > Some
> > > > > > >> discussion was also brought up there regarding non-jvm
> language
> > > > > support
> > > > > > >> strategy in general. To avoid polluting the survey thread, we
> > are
> > > > > > starting
> > > > > > >> this discussion thread and would like to move the discussions
> > > here.
> > > > > > >>
> > > > > > >> In the interest of facilitating the discussion, we would like
> to
> > > > first
> > > > > > >> share the following design doc which describes what we have
> done
> > > at
> > > > > > Alibaba
> > > > > > >> about Python API for Flink. It could serve as a good reference
> > to
> > > > the
> > > > > > >> discussion.
> > > > > > >>
> > > > > > >>   [DISCUSS] Flink Python API
> > > > > > >> <
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web
> > > > > > >>>
> > > > > > >>
> > > > > > >> As of now, we've implemented and delivered Python UDF for SQL
> > for
> > > > the
> > > > > > >> internal users at Alibaba.
> > > > > > >> We are starting to implement Python API.
> > > > > > >>
> > > > > > >> To recap and continue the discussion from the survey thread, I
> > > agree
> > > > > > with
> > > > > > >> @Stephan that we should figure out in which general direction
> > > Python
> > > > > > >> support should go. Stephan also list three options there:
> > > > > > >> * Option (1): Language portability via Apache Beam
> > > > > > >> * Option (2): Implement own Python API
> > > > > > >> * Option (3): Implement own portability layer
> > > > > > >>
> > > > > > >>  From my perspective,
> > > > > > >> (1). Flink language APIs and Beam's languages support are not
> > > > mutually
> > > > > > >> exclusive.
> > > > > > >> It is nice that Beam has Python/NodeJS/Go APIs, and support
> > Flink
> > > as
> > > > > the
> > > > > > >> runner.
> > > > > > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's
> > > > ecosystem.
> > > > > > >>
> > > > > > >> (2). Python API / portability layer
> > > > > > >> To support non-JVM languages in Flink,
> > > > > > >>   * at client side, Flink would provide language interfaces,
> > which
> > > > > will
> > > > > > >> translate user's application to Flink StreamGraph.
> > > > > > >> * at server side, Flink would execute user's UDF code at
> runtime
> > > > > > >> The non-JVM languages communicate with JVM via RPC(or
> low-level
> > > > > socket,
> > > > > > >> embedded interpreter and so on). What the portability layer
> can
> > do
> > > > > > maybe is
> > > > > > >> abstracting the RPC layer. When the portability layer is
> ready,
> > > > still
> > > > > > there
> > > > > > >> are lots of stuff to do for a specified language. Say, Python,
> > we
> > > > may
> > > > > > still
> > > > > > >> have to write the interface classes by hand for the users
> > because
> > > > > > generated
> > > > > > >> code without detailed documentation is unacceptable for users,
> > or
> > > > > handle
> > > > > > >> the serialization issue of lambda/closure which is not a
> > built-in
> > > > > > feature
> > > > > > >> in Python.  Maybe, we can start with Python API, then extend
> to
> > > > other
> > > > > > >> languages and abstract the logic in common as the portability
> > > layer.
> > > > > > >>
> > > > > > >> ---
> > > > > > >> References:
> > > > > > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python
> > > > > > >>
> > > > > > >> Regards,
> > > > > > >> Xianda
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


-- 
Best Regards

Jeff Zhang

Re: [DISCUSS] Python (and Non-JVM) Language Support in Flink

Posted by vino yang <ya...@gmail.com>.

Hi jincheng,

Thanks for activating this discussion again.
I personally look forward to your design draft.

Best,
Vino

jincheng sun <su...@gmail.com> 于2019年3月28日周四 下午12:16写道：

> Hi everyone,
> Sorry to join in this discussion late.
>
> Thanks to Xianda Ke for initiating this discussion. I also enjoy the
> discussions&suggestions by Max, Austin, Thomas, Shaoxuan and others.
>
> Recently, I did feel the desire of the community and Flink users for Python
> support. Stephan also pointed out in the discussion of `Adding a mid-term
> roadmap`: "Table API becomes primary API for analytics use cases", while a
> large number of users in analytics use cases are accustomed to the Python
> language, and the accumulation of a large number of class libraries is also
> deposited in the python library.
>
> So I am very interested in participating in the discussion of supporting
> Python in Flink. With regard to the three options mentioned so far, it is a
> great encouragement to leverage the beam’s language portable layer on
> Flink. For now, we can start with a step in the Flink to add a Py-tableAPI.
> I believe in, in this process, we will have a deeper understanding of how
> Flink support python. If we can quickly let users experience the first
> version of Flink Python TableAPI, we can also receive feedback from many
> users, and consider the long-term goals of multi-language support on Flink.
>
> So if you agree, I volunteer to draft a document that would support the
> detailed design and implementation plan of Py-TableAPI on Flink.
>
> What do you think?
>
> Shaoxuan Wang <ws...@gmail.com> 于2019年2月21日周四 下午10:44写道：
>
> > Hey guys,
> > Thanks for your comments and sorry for the late reply.
> > Beam Python API and Flink Python TableAPI describe the DAG/pipeline in
> > different manners. We got a chance to communicate with Tyler Akidau (from
> > Beam) offline, and explained why the Flink tableAPI needs a specific
> design
> > for python, rather than purely leverage Beam portability layer.
> >
> > In our proposal, most of the Python code is just a DAG/pipeline builder
> for
> > tableAPI. The majority of operators run purely in Java, while only python
> > UDFs executed in Python environment during the runtime. This design does
> > not affect the development and adoption of Beam language portability
> layer
> > with Flink runner. Flink and Beam community will still collaborate,
> jointly
> > develop and optimize on the JVM / Non-JVM (python,GO) bridge (data and
> > control connections between different processes) to ensure the robustness
> > and performance.
> >
> > Regards,
> > Shaoxuan
> >
> >
> > On Fri, Dec 21, 2018 at 1:39 PM Thomas Weise <th...@apache.org> wrote:
> >
> > > Interest in Python seems on the rise and so this is a good discussion
> to
> > > have :)
> > >
> > > So far there seems to be agreement that Beam's approach towards Python
> > and
> > > other non-JVM language support (language SDK, portability layer etc.)
> is
> > > the right direction? Specification and execution are native Python and
> it
> > > does not suffer from the shortcomings of Flink's Jython API and few
> other
> > > approaches.
> > >
> > > Overall there already is good alignment between Beam and Flink in
> > concepts
> > > and model. There are also few of us that are active in both
> communities.
> > > The Beam Flink runner has made a lot of progress this year, but work on
> > > portability in Beam actually started much before that and was a very
> big
> > > change (originally there was just the Java SDK). Much of the code has
> > been
> > > rewritten as part of the effort; that's what implementing a strong
> multi
> > > language support story took. To have a decent shot at it, the
> equivalent
> > of
> > > much of the Beam portability framework would need to be reinvented in
> > > Flink. This would fork resources and divert focus away from things that
> > may
> > > be more core to Flink. As you can guess I am in favor of option (1) !
> > >
> > > We could take a look at SQL for reference. Flink community has
> invested a
> > > lot in SQL and there remains a lot of work to do. Beam community has
> done
> > > the same and we have two completely separate implementations. When I
> > > recently learned more about the Beam SQL work, one of my first
> questions
> > > was if joined effort would not lead to better user value? Calcite is
> > > common, but isn't there much more that could be shared? Someone had the
> > > idea that in such a world Flink could just substitute portions or all
> of
> > > the graph provided by Beam with it's own optimized version but much of
> > the
> > > tooling could be same?
> > >
> > > IO connectors are another area where much effort is repeated. It takes
> a
> > > very long time to arrive at a solid, production quality implementation
> > > (typically resulting from broad user exposure and running at scale).
> > > Currently there is discussion how connectors can be done much better in
> > > both projects: SDF in Beam [1] and FLIP-27.
> > >
> > > It's a trade-off, but more synergy would be great!
> > >
> > > Thomas
> > >
> > > [1]
> > >
> > >
> >
> https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/
> > >
> > >
> > > On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett <
> > > whatwouldaustindo@gmail.com>
> > > wrote:
> > >
> > > > Hi Shaoxuan,
> > > >
> > > > FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the Bay
> > Area
> > > > Apache Beam Meetup[1] which included a bit on a vision for how Beam
> > could
> > > > better leverage runner specific optimizations -- as an
> > example/extension,
> > > > Beam SQL leveraging Flink specific SQL optimizations (to address your
> > > > point).  So, that is part of the eventual roadmap for Beam, and
> > > illustrates
> > > > how concrete efforts towards optimizations in Runner/SDK-Harness
> would
> > > > likely yield the desired result of cross-language support (which
> could
> > be
> > > > done by leveraging Beam, and devote focus to optimizing that
> processing
> > > on
> > > > Flink).
> > > >
> > > > Cheers,
> > > > Austin
> > > >
> > > >
> > > > [1]
> https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/
> > > --
> > > > I
> > > > can post/share videos once available should someone desire.
> > > >
> > > > On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <mx...@apache.org>
> > > wrote:
> > > >
> > > > > Hi Xianda, hi Shaoxuan,
> > > > >
> > > > > I'd be in favor of option (1). There is great potential in Beam and
> > > Flink
> > > > > joining forces on this one. Here's why:
> > > > >
> > > > > The Beam project spent at least a year developing a portability
> layer
> > > > with
> > > > > a
> > > > > reasonable amount of people working on it. Developing a new
> > portability
> > > > > layer
> > > > > from scratch will probably take about the same amount of time and
> > > > > resources.
> > > > >
> > > > > Concerning option (2): There is already a Python API for Flink but
> an
> > > API
> > > > > is
> > > > > only one part of the portability story. In Beam the portability is
> > > > > structured
> > > > > into three components:
> > > > >
> > > > > - SDK (API, its Protobuf serialization, and interaction with the
> SDK
> > > > > Harness)
> > > > > - Runner (Translation from Protobuf pipeline to Flink job)
> > > > > - SDK Harness (UDF execution, Interaction with the SDK and the
> > > execution
> > > > > engine)
> > > > >
> > > > > I could imagine the Flink Python API would be another SDK which
> could
> > > > have
> > > > > its
> > > > > own API but would reuse code for the interaction with the SDK
> > Harness.
> > > > >
> > > > > We would be able to focus on the optimizations instead of
> rebuilding
> > a
> > > > > portability layer from scratch.
> > > > >
> > > > > Thanks,
> > > > > Max
> > > > >
> > > > > On 13.12.18 11:52, Shaoxuan Wang wrote:
> > > > > > RE: Stephen's options (
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html
> > > > > > )
> > > > > > * Option (1): Language portability via Apache Beam
> > > > > > * Option (2): Implement own Python API
> > > > > > * Option (3): Implement own portability layer
> > > > > >
> > > > > > Hi Stephen,
> > > > > > Eventually, I think we should support both option1 and option3.
> > TMO,
> > > > > these
> > > > > > two options are orthogonal. I agree with you that we can leverage
> > the
> > > > > > existing work and ecosystem in beam by supporting option1. But
> the
> > > > > problem
> > > > > > of beam is that it skips (to the best of my knowledge) the
> natural
> > > > > > table/SQL optimization framework provided by Flink. We should
> spend
> > > all
> > > > > the
> > > > > > needed efforts to support solution1 (as it is the better
> > alternative
> > > of
> > > > > the
> > > > > > current Flink python API), but cannot solely bet on it. Option3
> is
> > > the
> > > > > > ideal choice for Flink to support all Non-JVM languages which we
> > > should
> > > > > > better plan to achieve. We have done some preliminary prototypes
> > for
> > > > > > option2/option3, and it seems not quite complex and difficult to
> > > > > accomplish.
> > > > > >
> > > > > > Regards,
> > > > > > Shaoxuan
> > > > > >
> > > > > >
> > > > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <ke...@gmail.com>
> > > wrote:
> > > > > >
> > > > > >> Currently there is an ongoing survey about Python usage of Flink
> > > [1].
> > > > > Some
> > > > > >> discussion was also brought up there regarding non-jvm language
> > > > support
> > > > > >> strategy in general. To avoid polluting the survey thread, we
> are
> > > > > starting
> > > > > >> this discussion thread and would like to move the discussions
> > here.
> > > > > >>
> > > > > >> In the interest of facilitating the discussion, we would like to
> > > first
> > > > > >> share the following design doc which describes what we have done
> > at
> > > > > Alibaba
> > > > > >> about Python API for Flink. It could serve as a good reference
> to
> > > the
> > > > > >> discussion.
> > > > > >>
> > > > > >>   [DISCUSS] Flink Python API
> > > > > >> <
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web
> > > > > >>>
> > > > > >>
> > > > > >> As of now, we've implemented and delivered Python UDF for SQL
> for
> > > the
> > > > > >> internal users at Alibaba.
> > > > > >> We are starting to implement Python API.
> > > > > >>
> > > > > >> To recap and continue the discussion from the survey thread, I
> > agree
> > > > > with
> > > > > >> @Stephan that we should figure out in which general direction
> > Python
> > > > > >> support should go. Stephan also list three options there:
> > > > > >> * Option (1): Language portability via Apache Beam
> > > > > >> * Option (2): Implement own Python API
> > > > > >> * Option (3): Implement own portability layer
> > > > > >>
> > > > > >>  From my perspective,
> > > > > >> (1). Flink language APIs and Beam's languages support are not
> > > mutually
> > > > > >> exclusive.
> > > > > >> It is nice that Beam has Python/NodeJS/Go APIs, and support
> Flink
> > as
> > > > the
> > > > > >> runner.
> > > > > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's
> > > ecosystem.
> > > > > >>
> > > > > >> (2). Python API / portability layer
> > > > > >> To support non-JVM languages in Flink,
> > > > > >>   * at client side, Flink would provide language interfaces,
> which
> > > > will
> > > > > >> translate user's application to Flink StreamGraph.
> > > > > >> * at server side, Flink would execute user's UDF code at runtime
> > > > > >> The non-JVM languages communicate with JVM via RPC(or low-level
> > > > socket,
> > > > > >> embedded interpreter and so on). What the portability layer can
> do
> > > > > maybe is
> > > > > >> abstracting the RPC layer. When the portability layer is ready,
> > > still
> > > > > there
> > > > > >> are lots of stuff to do for a specified language. Say, Python,
> we
> > > may
> > > > > still
> > > > > >> have to write the interface classes by hand for the users
> because
> > > > > generated
> > > > > >> code without detailed documentation is unacceptable for users,
> or
> > > > handle
> > > > > >> the serialization issue of lambda/closure which is not a
> built-in
> > > > > feature
> > > > > >> in Python.  Maybe, we can start with Python API, then extend to
> > > other
> > > > > >> languages and abstract the logic in common as the portability
> > layer.
> > > > > >>
> > > > > >> ---
> > > > > >> References:
> > > > > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python
> > > > > >>
> > > > > >> Regards,
> > > > > >> Xianda
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Python (and Non-JVM) Language Support in Flink

Posted by jincheng sun <su...@gmail.com>.

Hi everyone,
Sorry to join in this discussion late.

Thanks to Xianda Ke for initiating this discussion. I also enjoy the
discussions&suggestions by Max, Austin, Thomas, Shaoxuan and others.

Recently, I did feel the desire of the community and Flink users for Python
support. Stephan also pointed out in the discussion of `Adding a mid-term
roadmap`: "Table API becomes primary API for analytics use cases", while a
large number of users in analytics use cases are accustomed to the Python
language, and the accumulation of a large number of class libraries is also
deposited in the python library.

So I am very interested in participating in the discussion of supporting
Python in Flink. With regard to the three options mentioned so far, it is a
great encouragement to leverage the beam’s language portable layer on
Flink. For now, we can start with a step in the Flink to add a Py-tableAPI.
I believe in, in this process, we will have a deeper understanding of how
Flink support python. If we can quickly let users experience the first
version of Flink Python TableAPI, we can also receive feedback from many
users, and consider the long-term goals of multi-language support on Flink.

So if you agree, I volunteer to draft a document that would support the
detailed design and implementation plan of Py-TableAPI on Flink.

What do you think?

Shaoxuan Wang <ws...@gmail.com> 于2019年2月21日周四 下午10:44写道：

> Hey guys,
> Thanks for your comments and sorry for the late reply.
> Beam Python API and Flink Python TableAPI describe the DAG/pipeline in
> different manners. We got a chance to communicate with Tyler Akidau (from
> Beam) offline, and explained why the Flink tableAPI needs a specific design
> for python, rather than purely leverage Beam portability layer.
>
> In our proposal, most of the Python code is just a DAG/pipeline builder for
> tableAPI. The majority of operators run purely in Java, while only python
> UDFs executed in Python environment during the runtime. This design does
> not affect the development and adoption of Beam language portability layer
> with Flink runner. Flink and Beam community will still collaborate, jointly
> develop and optimize on the JVM / Non-JVM (python,GO) bridge (data and
> control connections between different processes) to ensure the robustness
> and performance.
>
> Regards,
> Shaoxuan
>
>
> On Fri, Dec 21, 2018 at 1:39 PM Thomas Weise <th...@apache.org> wrote:
>
> > Interest in Python seems on the rise and so this is a good discussion to
> > have :)
> >
> > So far there seems to be agreement that Beam's approach towards Python
> and
> > other non-JVM language support (language SDK, portability layer etc.) is
> > the right direction? Specification and execution are native Python and it
> > does not suffer from the shortcomings of Flink's Jython API and few other
> > approaches.
> >
> > Overall there already is good alignment between Beam and Flink in
> concepts
> > and model. There are also few of us that are active in both communities.
> > The Beam Flink runner has made a lot of progress this year, but work on
> > portability in Beam actually started much before that and was a very big
> > change (originally there was just the Java SDK). Much of the code has
> been
> > rewritten as part of the effort; that's what implementing a strong multi
> > language support story took. To have a decent shot at it, the equivalent
> of
> > much of the Beam portability framework would need to be reinvented in
> > Flink. This would fork resources and divert focus away from things that
> may
> > be more core to Flink. As you can guess I am in favor of option (1) !
> >
> > We could take a look at SQL for reference. Flink community has invested a
> > lot in SQL and there remains a lot of work to do. Beam community has done
> > the same and we have two completely separate implementations. When I
> > recently learned more about the Beam SQL work, one of my first questions
> > was if joined effort would not lead to better user value? Calcite is
> > common, but isn't there much more that could be shared? Someone had the
> > idea that in such a world Flink could just substitute portions or all of
> > the graph provided by Beam with it's own optimized version but much of
> the
> > tooling could be same?
> >
> > IO connectors are another area where much effort is repeated. It takes a
> > very long time to arrive at a solid, production quality implementation
> > (typically resulting from broad user exposure and running at scale).
> > Currently there is discussion how connectors can be done much better in
> > both projects: SDF in Beam [1] and FLIP-27.
> >
> > It's a trade-off, but more synergy would be great!
> >
> > Thomas
> >
> > [1]
> >
> >
> https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/
> >
> >
> > On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett <
> > whatwouldaustindo@gmail.com>
> > wrote:
> >
> > > Hi Shaoxuan,
> > >
> > > FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the Bay
> Area
> > > Apache Beam Meetup[1] which included a bit on a vision for how Beam
> could
> > > better leverage runner specific optimizations -- as an
> example/extension,
> > > Beam SQL leveraging Flink specific SQL optimizations (to address your
> > > point).  So, that is part of the eventual roadmap for Beam, and
> > illustrates
> > > how concrete efforts towards optimizations in Runner/SDK-Harness would
> > > likely yield the desired result of cross-language support (which could
> be
> > > done by leveraging Beam, and devote focus to optimizing that processing
> > on
> > > Flink).
> > >
> > > Cheers,
> > > Austin
> > >
> > >
> > > [1] https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/
> > --
> > > I
> > > can post/share videos once available should someone desire.
> > >
> > > On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <mx...@apache.org>
> > wrote:
> > >
> > > > Hi Xianda, hi Shaoxuan,
> > > >
> > > > I'd be in favor of option (1). There is great potential in Beam and
> > Flink
> > > > joining forces on this one. Here's why:
> > > >
> > > > The Beam project spent at least a year developing a portability layer
> > > with
> > > > a
> > > > reasonable amount of people working on it. Developing a new
> portability
> > > > layer
> > > > from scratch will probably take about the same amount of time and
> > > > resources.
> > > >
> > > > Concerning option (2): There is already a Python API for Flink but an
> > API
> > > > is
> > > > only one part of the portability story. In Beam the portability is
> > > > structured
> > > > into three components:
> > > >
> > > > - SDK (API, its Protobuf serialization, and interaction with the SDK
> > > > Harness)
> > > > - Runner (Translation from Protobuf pipeline to Flink job)
> > > > - SDK Harness (UDF execution, Interaction with the SDK and the
> > execution
> > > > engine)
> > > >
> > > > I could imagine the Flink Python API would be another SDK which could
> > > have
> > > > its
> > > > own API but would reuse code for the interaction with the SDK
> Harness.
> > > >
> > > > We would be able to focus on the optimizations instead of rebuilding
> a
> > > > portability layer from scratch.
> > > >
> > > > Thanks,
> > > > Max
> > > >
> > > > On 13.12.18 11:52, Shaoxuan Wang wrote:
> > > > > RE: Stephen's options (
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html
> > > > > )
> > > > > * Option (1): Language portability via Apache Beam
> > > > > * Option (2): Implement own Python API
> > > > > * Option (3): Implement own portability layer
> > > > >
> > > > > Hi Stephen,
> > > > > Eventually, I think we should support both option1 and option3.
> TMO,
> > > > these
> > > > > two options are orthogonal. I agree with you that we can leverage
> the
> > > > > existing work and ecosystem in beam by supporting option1. But the
> > > > problem
> > > > > of beam is that it skips (to the best of my knowledge) the natural
> > > > > table/SQL optimization framework provided by Flink. We should spend
> > all
> > > > the
> > > > > needed efforts to support solution1 (as it is the better
> alternative
> > of
> > > > the
> > > > > current Flink python API), but cannot solely bet on it. Option3 is
> > the
> > > > > ideal choice for Flink to support all Non-JVM languages which we
> > should
> > > > > better plan to achieve. We have done some preliminary prototypes
> for
> > > > > option2/option3, and it seems not quite complex and difficult to
> > > > accomplish.
> > > > >
> > > > > Regards,
> > > > > Shaoxuan
> > > > >
> > > > >
> > > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <ke...@gmail.com>
> > wrote:
> > > > >
> > > > >> Currently there is an ongoing survey about Python usage of Flink
> > [1].
> > > > Some
> > > > >> discussion was also brought up there regarding non-jvm language
> > > support
> > > > >> strategy in general. To avoid polluting the survey thread, we are
> > > > starting
> > > > >> this discussion thread and would like to move the discussions
> here.
> > > > >>
> > > > >> In the interest of facilitating the discussion, we would like to
> > first
> > > > >> share the following design doc which describes what we have done
> at
> > > > Alibaba
> > > > >> about Python API for Flink. It could serve as a good reference to
> > the
> > > > >> discussion.
> > > > >>
> > > > >>   [DISCUSS] Flink Python API
> > > > >> <
> > > > >>
> > > >
> > >
> >
> https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web
> > > > >>>
> > > > >>
> > > > >> As of now, we've implemented and delivered Python UDF for SQL for
> > the
> > > > >> internal users at Alibaba.
> > > > >> We are starting to implement Python API.
> > > > >>
> > > > >> To recap and continue the discussion from the survey thread, I
> agree
> > > > with
> > > > >> @Stephan that we should figure out in which general direction
> Python
> > > > >> support should go. Stephan also list three options there:
> > > > >> * Option (1): Language portability via Apache Beam
> > > > >> * Option (2): Implement own Python API
> > > > >> * Option (3): Implement own portability layer
> > > > >>
> > > > >>  From my perspective,
> > > > >> (1). Flink language APIs and Beam's languages support are not
> > mutually
> > > > >> exclusive.
> > > > >> It is nice that Beam has Python/NodeJS/Go APIs, and support Flink
> as
> > > the
> > > > >> runner.
> > > > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's
> > ecosystem.
> > > > >>
> > > > >> (2). Python API / portability layer
> > > > >> To support non-JVM languages in Flink,
> > > > >>   * at client side, Flink would provide language interfaces, which
> > > will
> > > > >> translate user's application to Flink StreamGraph.
> > > > >> * at server side, Flink would execute user's UDF code at runtime
> > > > >> The non-JVM languages communicate with JVM via RPC(or low-level
> > > socket,
> > > > >> embedded interpreter and so on). What the portability layer can do
> > > > maybe is
> > > > >> abstracting the RPC layer. When the portability layer is ready,
> > still
> > > > there
> > > > >> are lots of stuff to do for a specified language. Say, Python, we
> > may
> > > > still
> > > > >> have to write the interface classes by hand for the users because
> > > > generated
> > > > >> code without detailed documentation is unacceptable for users, or
> > > handle
> > > > >> the serialization issue of lambda/closure which is not a built-in
> > > > feature
> > > > >> in Python.  Maybe, we can start with Python API, then extend to
> > other
> > > > >> languages and abstract the logic in common as the portability
> layer.
> > > > >>
> > > > >> ---
> > > > >> References:
> > > > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python
> > > > >>
> > > > >> Regards,
> > > > >> Xianda
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Python (and Non-JVM) Language Support in Flink

Posted by Shaoxuan Wang <ws...@gmail.com>.

Hey guys,
Thanks for your comments and sorry for the late reply.
Beam Python API and Flink Python TableAPI describe the DAG/pipeline in
different manners. We got a chance to communicate with Tyler Akidau (from
Beam) offline, and explained why the Flink tableAPI needs a specific design
for python, rather than purely leverage Beam portability layer.

In our proposal, most of the Python code is just a DAG/pipeline builder for
tableAPI. The majority of operators run purely in Java, while only python
UDFs executed in Python environment during the runtime. This design does
not affect the development and adoption of Beam language portability layer
with Flink runner. Flink and Beam community will still collaborate, jointly
develop and optimize on the JVM / Non-JVM (python,GO) bridge (data and
control connections between different processes) to ensure the robustness
and performance.

Regards,
Shaoxuan


On Fri, Dec 21, 2018 at 1:39 PM Thomas Weise <th...@apache.org> wrote:

> Interest in Python seems on the rise and so this is a good discussion to
> have :)
>
> So far there seems to be agreement that Beam's approach towards Python and
> other non-JVM language support (language SDK, portability layer etc.) is
> the right direction? Specification and execution are native Python and it
> does not suffer from the shortcomings of Flink's Jython API and few other
> approaches.
>
> Overall there already is good alignment between Beam and Flink in concepts
> and model. There are also few of us that are active in both communities.
> The Beam Flink runner has made a lot of progress this year, but work on
> portability in Beam actually started much before that and was a very big
> change (originally there was just the Java SDK). Much of the code has been
> rewritten as part of the effort; that's what implementing a strong multi
> language support story took. To have a decent shot at it, the equivalent of
> much of the Beam portability framework would need to be reinvented in
> Flink. This would fork resources and divert focus away from things that may
> be more core to Flink. As you can guess I am in favor of option (1) !
>
> We could take a look at SQL for reference. Flink community has invested a
> lot in SQL and there remains a lot of work to do. Beam community has done
> the same and we have two completely separate implementations. When I
> recently learned more about the Beam SQL work, one of my first questions
> was if joined effort would not lead to better user value? Calcite is
> common, but isn't there much more that could be shared? Someone had the
> idea that in such a world Flink could just substitute portions or all of
> the graph provided by Beam with it's own optimized version but much of the
> tooling could be same?
>
> IO connectors are another area where much effort is repeated. It takes a
> very long time to arrive at a solid, production quality implementation
> (typically resulting from broad user exposure and running at scale).
> Currently there is discussion how connectors can be done much better in
> both projects: SDF in Beam [1] and FLIP-27.
>
> It's a trade-off, but more synergy would be great!
>
> Thomas
>
> [1]
>
> https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/
>
>
> On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett <
> whatwouldaustindo@gmail.com>
> wrote:
>
> > Hi Shaoxuan,
> >
> > FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the Bay Area
> > Apache Beam Meetup[1] which included a bit on a vision for how Beam could
> > better leverage runner specific optimizations -- as an example/extension,
> > Beam SQL leveraging Flink specific SQL optimizations (to address your
> > point).  So, that is part of the eventual roadmap for Beam, and
> illustrates
> > how concrete efforts towards optimizations in Runner/SDK-Harness would
> > likely yield the desired result of cross-language support (which could be
> > done by leveraging Beam, and devote focus to optimizing that processing
> on
> > Flink).
> >
> > Cheers,
> > Austin
> >
> >
> > [1] https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/
> --
> > I
> > can post/share videos once available should someone desire.
> >
> > On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <mx...@apache.org>
> wrote:
> >
> > > Hi Xianda, hi Shaoxuan,
> > >
> > > I'd be in favor of option (1). There is great potential in Beam and
> Flink
> > > joining forces on this one. Here's why:
> > >
> > > The Beam project spent at least a year developing a portability layer
> > with
> > > a
> > > reasonable amount of people working on it. Developing a new portability
> > > layer
> > > from scratch will probably take about the same amount of time and
> > > resources.
> > >
> > > Concerning option (2): There is already a Python API for Flink but an
> API
> > > is
> > > only one part of the portability story. In Beam the portability is
> > > structured
> > > into three components:
> > >
> > > - SDK (API, its Protobuf serialization, and interaction with the SDK
> > > Harness)
> > > - Runner (Translation from Protobuf pipeline to Flink job)
> > > - SDK Harness (UDF execution, Interaction with the SDK and the
> execution
> > > engine)
> > >
> > > I could imagine the Flink Python API would be another SDK which could
> > have
> > > its
> > > own API but would reuse code for the interaction with the SDK Harness.
> > >
> > > We would be able to focus on the optimizations instead of rebuilding a
> > > portability layer from scratch.
> > >
> > > Thanks,
> > > Max
> > >
> > > On 13.12.18 11:52, Shaoxuan Wang wrote:
> > > > RE: Stephen's options (
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html
> > > > )
> > > > * Option (1): Language portability via Apache Beam
> > > > * Option (2): Implement own Python API
> > > > * Option (3): Implement own portability layer
> > > >
> > > > Hi Stephen,
> > > > Eventually, I think we should support both option1 and option3. TMO,
> > > these
> > > > two options are orthogonal. I agree with you that we can leverage the
> > > > existing work and ecosystem in beam by supporting option1. But the
> > > problem
> > > > of beam is that it skips (to the best of my knowledge) the natural
> > > > table/SQL optimization framework provided by Flink. We should spend
> all
> > > the
> > > > needed efforts to support solution1 (as it is the better alternative
> of
> > > the
> > > > current Flink python API), but cannot solely bet on it. Option3 is
> the
> > > > ideal choice for Flink to support all Non-JVM languages which we
> should
> > > > better plan to achieve. We have done some preliminary prototypes for
> > > > option2/option3, and it seems not quite complex and difficult to
> > > accomplish.
> > > >
> > > > Regards,
> > > > Shaoxuan
> > > >
> > > >
> > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <ke...@gmail.com>
> wrote:
> > > >
> > > >> Currently there is an ongoing survey about Python usage of Flink
> [1].
> > > Some
> > > >> discussion was also brought up there regarding non-jvm language
> > support
> > > >> strategy in general. To avoid polluting the survey thread, we are
> > > starting
> > > >> this discussion thread and would like to move the discussions here.
> > > >>
> > > >> In the interest of facilitating the discussion, we would like to
> first
> > > >> share the following design doc which describes what we have done at
> > > Alibaba
> > > >> about Python API for Flink. It could serve as a good reference to
> the
> > > >> discussion.
> > > >>
> > > >>   [DISCUSS] Flink Python API
> > > >> <
> > > >>
> > >
> >
> https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web
> > > >>>
> > > >>
> > > >> As of now, we've implemented and delivered Python UDF for SQL for
> the
> > > >> internal users at Alibaba.
> > > >> We are starting to implement Python API.
> > > >>
> > > >> To recap and continue the discussion from the survey thread, I agree
> > > with
> > > >> @Stephan that we should figure out in which general direction Python
> > > >> support should go. Stephan also list three options there:
> > > >> * Option (1): Language portability via Apache Beam
> > > >> * Option (2): Implement own Python API
> > > >> * Option (3): Implement own portability layer
> > > >>
> > > >>  From my perspective,
> > > >> (1). Flink language APIs and Beam's languages support are not
> mutually
> > > >> exclusive.
> > > >> It is nice that Beam has Python/NodeJS/Go APIs, and support Flink as
> > the
> > > >> runner.
> > > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's
> ecosystem.
> > > >>
> > > >> (2). Python API / portability layer
> > > >> To support non-JVM languages in Flink,
> > > >>   * at client side, Flink would provide language interfaces, which
> > will
> > > >> translate user's application to Flink StreamGraph.
> > > >> * at server side, Flink would execute user's UDF code at runtime
> > > >> The non-JVM languages communicate with JVM via RPC(or low-level
> > socket,
> > > >> embedded interpreter and so on). What the portability layer can do
> > > maybe is
> > > >> abstracting the RPC layer. When the portability layer is ready,
> still
> > > there
> > > >> are lots of stuff to do for a specified language. Say, Python, we
> may
> > > still
> > > >> have to write the interface classes by hand for the users because
> > > generated
> > > >> code without detailed documentation is unacceptable for users, or
> > handle
> > > >> the serialization issue of lambda/closure which is not a built-in
> > > feature
> > > >> in Python.  Maybe, we can start with Python API, then extend to
> other
> > > >> languages and abstract the logic in common as the portability layer.
> > > >>
> > > >> ---
> > > >> References:
> > > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python
> > > >>
> > > >> Regards,
> > > >> Xianda
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] Python (and Non-JVM) Language Support in Flink

Posted by Thomas Weise <th...@apache.org>.

Interest in Python seems on the rise and so this is a good discussion to
have :)

So far there seems to be agreement that Beam's approach towards Python and
other non-JVM language support (language SDK, portability layer etc.) is
the right direction? Specification and execution are native Python and it
does not suffer from the shortcomings of Flink's Jython API and few other
approaches.

Overall there already is good alignment between Beam and Flink in concepts
and model. There are also few of us that are active in both communities.
The Beam Flink runner has made a lot of progress this year, but work on
portability in Beam actually started much before that and was a very big
change (originally there was just the Java SDK). Much of the code has been
rewritten as part of the effort; that's what implementing a strong multi
language support story took. To have a decent shot at it, the equivalent of
much of the Beam portability framework would need to be reinvented in
Flink. This would fork resources and divert focus away from things that may
be more core to Flink. As you can guess I am in favor of option (1) !

We could take a look at SQL for reference. Flink community has invested a
lot in SQL and there remains a lot of work to do. Beam community has done
the same and we have two completely separate implementations. When I
recently learned more about the Beam SQL work, one of my first questions
was if joined effort would not lead to better user value? Calcite is
common, but isn't there much more that could be shared? Someone had the
idea that in such a world Flink could just substitute portions or all of
the graph provided by Beam with it's own optimized version but much of the
tooling could be same?

IO connectors are another area where much effort is repeated. It takes a
very long time to arrive at a solid, production quality implementation
(typically resulting from broad user exposure and running at scale).
Currently there is discussion how connectors can be done much better in
both projects: SDF in Beam [1] and FLIP-27.

It's a trade-off, but more synergy would be great!

Thomas

[1]
https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/


On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett <wh...@gmail.com>
wrote:

> Hi Shaoxuan,
>
> FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the Bay Area
> Apache Beam Meetup[1] which included a bit on a vision for how Beam could
> better leverage runner specific optimizations -- as an example/extension,
> Beam SQL leveraging Flink specific SQL optimizations (to address your
> point).  So, that is part of the eventual roadmap for Beam, and illustrates
> how concrete efforts towards optimizations in Runner/SDK-Harness would
> likely yield the desired result of cross-language support (which could be
> done by leveraging Beam, and devote focus to optimizing that processing on
> Flink).
>
> Cheers,
> Austin
>
>
> [1] https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/ --
> I
> can post/share videos once available should someone desire.
>
> On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <mx...@apache.org> wrote:
>
> > Hi Xianda, hi Shaoxuan,
> >
> > I'd be in favor of option (1). There is great potential in Beam and Flink
> > joining forces on this one. Here's why:
> >
> > The Beam project spent at least a year developing a portability layer
> with
> > a
> > reasonable amount of people working on it. Developing a new portability
> > layer
> > from scratch will probably take about the same amount of time and
> > resources.
> >
> > Concerning option (2): There is already a Python API for Flink but an API
> > is
> > only one part of the portability story. In Beam the portability is
> > structured
> > into three components:
> >
> > - SDK (API, its Protobuf serialization, and interaction with the SDK
> > Harness)
> > - Runner (Translation from Protobuf pipeline to Flink job)
> > - SDK Harness (UDF execution, Interaction with the SDK and the execution
> > engine)
> >
> > I could imagine the Flink Python API would be another SDK which could
> have
> > its
> > own API but would reuse code for the interaction with the SDK Harness.
> >
> > We would be able to focus on the optimizations instead of rebuilding a
> > portability layer from scratch.
> >
> > Thanks,
> > Max
> >
> > On 13.12.18 11:52, Shaoxuan Wang wrote:
> > > RE: Stephen's options (
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html
> > > )
> > > * Option (1): Language portability via Apache Beam
> > > * Option (2): Implement own Python API
> > > * Option (3): Implement own portability layer
> > >
> > > Hi Stephen,
> > > Eventually, I think we should support both option1 and option3. TMO,
> > these
> > > two options are orthogonal. I agree with you that we can leverage the
> > > existing work and ecosystem in beam by supporting option1. But the
> > problem
> > > of beam is that it skips (to the best of my knowledge) the natural
> > > table/SQL optimization framework provided by Flink. We should spend all
> > the
> > > needed efforts to support solution1 (as it is the better alternative of
> > the
> > > current Flink python API), but cannot solely bet on it. Option3 is the
> > > ideal choice for Flink to support all Non-JVM languages which we should
> > > better plan to achieve. We have done some preliminary prototypes for
> > > option2/option3, and it seems not quite complex and difficult to
> > accomplish.
> > >
> > > Regards,
> > > Shaoxuan
> > >
> > >
> > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <ke...@gmail.com> wrote:
> > >
> > >> Currently there is an ongoing survey about Python usage of Flink [1].
> > Some
> > >> discussion was also brought up there regarding non-jvm language
> support
> > >> strategy in general. To avoid polluting the survey thread, we are
> > starting
> > >> this discussion thread and would like to move the discussions here.
> > >>
> > >> In the interest of facilitating the discussion, we would like to first
> > >> share the following design doc which describes what we have done at
> > Alibaba
> > >> about Python API for Flink. It could serve as a good reference to the
> > >> discussion.
> > >>
> > >>   [DISCUSS] Flink Python API
> > >> <
> > >>
> >
> https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web
> > >>>
> > >>
> > >> As of now, we've implemented and delivered Python UDF for SQL for the
> > >> internal users at Alibaba.
> > >> We are starting to implement Python API.
> > >>
> > >> To recap and continue the discussion from the survey thread, I agree
> > with
> > >> @Stephan that we should figure out in which general direction Python
> > >> support should go. Stephan also list three options there:
> > >> * Option (1): Language portability via Apache Beam
> > >> * Option (2): Implement own Python API
> > >> * Option (3): Implement own portability layer
> > >>
> > >>  From my perspective,
> > >> (1). Flink language APIs and Beam's languages support are not mutually
> > >> exclusive.
> > >> It is nice that Beam has Python/NodeJS/Go APIs, and support Flink as
> the
> > >> runner.
> > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's ecosystem.
> > >>
> > >> (2). Python API / portability layer
> > >> To support non-JVM languages in Flink,
> > >>   * at client side, Flink would provide language interfaces, which
> will
> > >> translate user's application to Flink StreamGraph.
> > >> * at server side, Flink would execute user's UDF code at runtime
> > >> The non-JVM languages communicate with JVM via RPC(or low-level
> socket,
> > >> embedded interpreter and so on). What the portability layer can do
> > maybe is
> > >> abstracting the RPC layer. When the portability layer is ready, still
> > there
> > >> are lots of stuff to do for a specified language. Say, Python, we may
> > still
> > >> have to write the interface classes by hand for the users because
> > generated
> > >> code without detailed documentation is unacceptable for users, or
> handle
> > >> the serialization issue of lambda/closure which is not a built-in
> > feature
> > >> in Python.  Maybe, we can start with Python API, then extend to other
> > >> languages and abstract the logic in common as the portability layer.
> > >>
> > >> ---
> > >> References:
> > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python
> > >>
> > >> Regards,
> > >> Xianda
> > >>
> > >
> >
>

Re: [DISCUSS] Python (and Non-JVM) Language Support in Flink

Posted by Austin Bennett <wh...@gmail.com>.

Hi Shaoxuan,

FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the Bay Area
Apache Beam Meetup[1] which included a bit on a vision for how Beam could
better leverage runner specific optimizations -- as an example/extension,
Beam SQL leveraging Flink specific SQL optimizations (to address your
point).  So, that is part of the eventual roadmap for Beam, and illustrates
how concrete efforts towards optimizations in Runner/SDK-Harness would
likely yield the desired result of cross-language support (which could be
done by leveraging Beam, and devote focus to optimizing that processing on
Flink).

Cheers,
Austin


[1] https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/ -- I
can post/share videos once available should someone desire.

On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <mx...@apache.org> wrote:

> Hi Xianda, hi Shaoxuan,
>
> I'd be in favor of option (1). There is great potential in Beam and Flink
> joining forces on this one. Here's why:
>
> The Beam project spent at least a year developing a portability layer with
> a
> reasonable amount of people working on it. Developing a new portability
> layer
> from scratch will probably take about the same amount of time and
> resources.
>
> Concerning option (2): There is already a Python API for Flink but an API
> is
> only one part of the portability story. In Beam the portability is
> structured
> into three components:
>
> - SDK (API, its Protobuf serialization, and interaction with the SDK
> Harness)
> - Runner (Translation from Protobuf pipeline to Flink job)
> - SDK Harness (UDF execution, Interaction with the SDK and the execution
> engine)
>
> I could imagine the Flink Python API would be another SDK which could have
> its
> own API but would reuse code for the interaction with the SDK Harness.
>
> We would be able to focus on the optimizations instead of rebuilding a
> portability layer from scratch.
>
> Thanks,
> Max
>
> On 13.12.18 11:52, Shaoxuan Wang wrote:
> > RE: Stephen's options (
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html
> > )
> > * Option (1): Language portability via Apache Beam
> > * Option (2): Implement own Python API
> > * Option (3): Implement own portability layer
> >
> > Hi Stephen,
> > Eventually, I think we should support both option1 and option3. TMO,
> these
> > two options are orthogonal. I agree with you that we can leverage the
> > existing work and ecosystem in beam by supporting option1. But the
> problem
> > of beam is that it skips (to the best of my knowledge) the natural
> > table/SQL optimization framework provided by Flink. We should spend all
> the
> > needed efforts to support solution1 (as it is the better alternative of
> the
> > current Flink python API), but cannot solely bet on it. Option3 is the
> > ideal choice for Flink to support all Non-JVM languages which we should
> > better plan to achieve. We have done some preliminary prototypes for
> > option2/option3, and it seems not quite complex and difficult to
> accomplish.
> >
> > Regards,
> > Shaoxuan
> >
> >
> > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <ke...@gmail.com> wrote:
> >
> >> Currently there is an ongoing survey about Python usage of Flink [1].
> Some
> >> discussion was also brought up there regarding non-jvm language support
> >> strategy in general. To avoid polluting the survey thread, we are
> starting
> >> this discussion thread and would like to move the discussions here.
> >>
> >> In the interest of facilitating the discussion, we would like to first
> >> share the following design doc which describes what we have done at
> Alibaba
> >> about Python API for Flink. It could serve as a good reference to the
> >> discussion.
> >>
> >>   [DISCUSS] Flink Python API
> >> <
> >>
> https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web
> >>>
> >>
> >> As of now, we've implemented and delivered Python UDF for SQL for the
> >> internal users at Alibaba.
> >> We are starting to implement Python API.
> >>
> >> To recap and continue the discussion from the survey thread, I agree
> with
> >> @Stephan that we should figure out in which general direction Python
> >> support should go. Stephan also list three options there:
> >> * Option (1): Language portability via Apache Beam
> >> * Option (2): Implement own Python API
> >> * Option (3): Implement own portability layer
> >>
> >>  From my perspective,
> >> (1). Flink language APIs and Beam's languages support are not mutually
> >> exclusive.
> >> It is nice that Beam has Python/NodeJS/Go APIs, and support Flink as the
> >> runner.
> >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's ecosystem.
> >>
> >> (2). Python API / portability layer
> >> To support non-JVM languages in Flink,
> >>   * at client side, Flink would provide language interfaces, which will
> >> translate user's application to Flink StreamGraph.
> >> * at server side, Flink would execute user's UDF code at runtime
> >> The non-JVM languages communicate with JVM via RPC(or low-level socket,
> >> embedded interpreter and so on). What the portability layer can do
> maybe is
> >> abstracting the RPC layer. When the portability layer is ready, still
> there
> >> are lots of stuff to do for a specified language. Say, Python, we may
> still
> >> have to write the interface classes by hand for the users because
> generated
> >> code without detailed documentation is unacceptable for users, or handle
> >> the serialization issue of lambda/closure which is not a built-in
> feature
> >> in Python.  Maybe, we can start with Python API, then extend to other
> >> languages and abstract the logic in common as the portability layer.
> >>
> >> ---
> >> References:
> >> [1] [SURVEY] Usage of flink-python and flink-streaming-python
> >>
> >> Regards,
> >> Xianda
> >>
> >
>

Re: [DISCUSS] Python (and Non-JVM) Language Support in Flink

Posted by Maximilian Michels <mx...@apache.org>.

Hi Xianda, hi Shaoxuan,

I'd be in favor of option (1). There is great potential in Beam and Flink 
joining forces on this one. Here's why:

The Beam project spent at least a year developing a portability layer with a 
reasonable amount of people working on it. Developing a new portability layer 
from scratch will probably take about the same amount of time and resources.

Concerning option (2): There is already a Python API for Flink but an API is 
only one part of the portability story. In Beam the portability is structured 
into three components:

- SDK (API, its Protobuf serialization, and interaction with the SDK Harness)
- Runner (Translation from Protobuf pipeline to Flink job)
- SDK Harness (UDF execution, Interaction with the SDK and the execution engine)

I could imagine the Flink Python API would be another SDK which could have its 
own API but would reuse code for the interaction with the SDK Harness.

We would be able to focus on the optimizations instead of rebuilding a 
portability layer from scratch.

Thanks,
Max

On 13.12.18 11:52, Shaoxuan Wang wrote:
> RE: Stephen's options (
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html
> )
> * Option (1): Language portability via Apache Beam
> * Option (2): Implement own Python API
> * Option (3): Implement own portability layer
> 
> Hi Stephen,
> Eventually, I think we should support both option1 and option3. TMO, these
> two options are orthogonal. I agree with you that we can leverage the
> existing work and ecosystem in beam by supporting option1. But the problem
> of beam is that it skips (to the best of my knowledge) the natural
> table/SQL optimization framework provided by Flink. We should spend all the
> needed efforts to support solution1 (as it is the better alternative of the
> current Flink python API), but cannot solely bet on it. Option3 is the
> ideal choice for Flink to support all Non-JVM languages which we should
> better plan to achieve. We have done some preliminary prototypes for
> option2/option3, and it seems not quite complex and difficult to accomplish.
> 
> Regards,
> Shaoxuan
> 
> 
> On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <ke...@gmail.com> wrote:
> 
>> Currently there is an ongoing survey about Python usage of Flink [1]. Some
>> discussion was also brought up there regarding non-jvm language support
>> strategy in general. To avoid polluting the survey thread, we are starting
>> this discussion thread and would like to move the discussions here.
>>
>> In the interest of facilitating the discussion, we would like to first
>> share the following design doc which describes what we have done at Alibaba
>> about Python API for Flink. It could serve as a good reference to the
>> discussion.
>>
>>   [DISCUSS] Flink Python API
>> <
>> https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web
>>>
>>
>> As of now, we've implemented and delivered Python UDF for SQL for the
>> internal users at Alibaba.
>> We are starting to implement Python API.
>>
>> To recap and continue the discussion from the survey thread, I agree with
>> @Stephan that we should figure out in which general direction Python
>> support should go. Stephan also list three options there:
>> * Option (1): Language portability via Apache Beam
>> * Option (2): Implement own Python API
>> * Option (3): Implement own portability layer
>>
>>  From my perspective,
>> (1). Flink language APIs and Beam's languages support are not mutually
>> exclusive.
>> It is nice that Beam has Python/NodeJS/Go APIs, and support Flink as the
>> runner.
>> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's ecosystem.
>>
>> (2). Python API / portability layer
>> To support non-JVM languages in Flink,
>>   * at client side, Flink would provide language interfaces, which will
>> translate user's application to Flink StreamGraph.
>> * at server side, Flink would execute user's UDF code at runtime
>> The non-JVM languages communicate with JVM via RPC(or low-level socket,
>> embedded interpreter and so on). What the portability layer can do maybe is
>> abstracting the RPC layer. When the portability layer is ready, still there
>> are lots of stuff to do for a specified language. Say, Python, we may still
>> have to write the interface classes by hand for the users because generated
>> code without detailed documentation is unacceptable for users, or handle
>> the serialization issue of lambda/closure which is not a built-in feature
>> in Python.  Maybe, we can start with Python API, then extend to other
>> languages and abstract the logic in common as the portability layer.
>>
>> ---
>> References:
>> [1] [SURVEY] Usage of flink-python and flink-streaming-python
>>
>> Regards,
>> Xianda
>>
>

Re: [DISCUSS] Python (and Non-JVM) Language Support in Flink

Posted by Shaoxuan Wang <ws...@gmail.com>.

RE: Stephen's options (
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html
)
* Option (1): Language portability via Apache Beam
* Option (2): Implement own Python API
* Option (3): Implement own portability layer

Hi Stephen,
Eventually, I think we should support both option1 and option3. TMO, these
two options are orthogonal. I agree with you that we can leverage the
existing work and ecosystem in beam by supporting option1. But the problem
of beam is that it skips (to the best of my knowledge) the natural
table/SQL optimization framework provided by Flink. We should spend all the
needed efforts to support solution1 (as it is the better alternative of the
current Flink python API), but cannot solely bet on it. Option3 is the
ideal choice for Flink to support all Non-JVM languages which we should
better plan to achieve. We have done some preliminary prototypes for
option2/option3, and it seems not quite complex and difficult to accomplish.

Regards,
Shaoxuan


On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <ke...@gmail.com> wrote:

> Currently there is an ongoing survey about Python usage of Flink [1]. Some
> discussion was also brought up there regarding non-jvm language support
> strategy in general. To avoid polluting the survey thread, we are starting
> this discussion thread and would like to move the discussions here.
>
> In the interest of facilitating the discussion, we would like to first
> share the following design doc which describes what we have done at Alibaba
> about Python API for Flink. It could serve as a good reference to the
> discussion.
>
>  [DISCUSS] Flink Python API
> <
> https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web
> >
>
> As of now, we've implemented and delivered Python UDF for SQL for the
> internal users at Alibaba.
> We are starting to implement Python API.
>
> To recap and continue the discussion from the survey thread, I agree with
> @Stephan that we should figure out in which general direction Python
> support should go. Stephan also list three options there:
> * Option (1): Language portability via Apache Beam
> * Option (2): Implement own Python API
> * Option (3): Implement own portability layer
>
> From my perspective,
> (1). Flink language APIs and Beam's languages support are not mutually
> exclusive.
> It is nice that Beam has Python/NodeJS/Go APIs, and support Flink as the
> runner.
> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's ecosystem.
>
> (2). Python API / portability layer
> To support non-JVM languages in Flink,
>  * at client side, Flink would provide language interfaces, which will
> translate user's application to Flink StreamGraph.
> * at server side, Flink would execute user's UDF code at runtime
> The non-JVM languages communicate with JVM via RPC(or low-level socket,
> embedded interpreter and so on). What the portability layer can do maybe is
> abstracting the RPC layer. When the portability layer is ready, still there
> are lots of stuff to do for a specified language. Say, Python, we may still
> have to write the interface classes by hand for the users because generated
> code without detailed documentation is unacceptable for users, or handle
> the serialization issue of lambda/closure which is not a built-in feature
> in Python.  Maybe, we can start with Python API, then extend to other
> languages and abstract the logic in common as the portability layer.
>
> ---
> References:
> [1] [SURVEY] Usage of flink-python and flink-streaming-python
>
> Regards,
> Xianda
>