You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Shuiqiang Chen <ac...@gmail.com> on 2020/07/08 03:13:30 UTC

[DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Hi everyone,

As we all know, Flink provides three layered APIs: the ProcessFunctions,
the DataStream API and the SQL & Table API. Each API offers a different
trade-off between conciseness and expressiveness and targets different use
cases[1].

Currently, the SQL & Table API has already been supported in PyFlink. The
API provides relational operations as well as user-defined functions to
provide convenience for users who are familiar with python and relational
programming.

Meanwhile, the DataStream API and ProcessFunctions provide more generic
APIs to implement stream processing applications. The ProcessFunctions
expose time and state which are the fundamental building blocks for any
kind of streaming application.
To cover more use cases, we are planning to cover all these APIs in
PyFlink.

In this discussion(FLIP-130), we propose to support the Python DataStream
API for the stateless part. For more detail, please refer to the FLIP wiki
page here[2]. If interested in the stateful part, you can also take a look
the design doc here[3] for which we are going to discuss in a separate FLIP.


Any comments will be highly appreciated!

[1] https://flink.apache.org/flink-applications.html#layered-apis
[2]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
[3]
https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing

Best,
Shuiqiang

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Posted by Shuiqiang Chen <ac...@gmail.com>.

Hi blackjact,

We are already working on the design of stateful part, however, it will not be supported in 1.12. We hope to support it in the later releases, e.g. 1.13. Thank you for your attention.

Regards,
Shuiqiang 

> 在 2020年10月20日，下午7:34，blackjjcat <bl...@163.com> 写道：
> 
> May I ask the plan of stateful part, which version is expected to be
> integrated?
> 
> 
> 
> --
> Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Posted by blackjjcat <bl...@163.com>.

May I ask the plan of stateful part, which version is expected to be
integrated?



--
Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Posted by Hequn Cheng <he...@apache.org>.

Hi,

It's a good idea to start with a minimum size of API and add useful ones
when we find it is truly useful.
From my side, I'm also ok with the partitionCustom() method. Thanks David
for your feedback!

Best,
Hequn

On Mon, Jul 27, 2020 at 8:57 PM Aljoscha Krettek <al...@apache.org>
wrote:

> Hi,
>
> I'm also not against adding that if it enables actual use cases. I don't
> think we need to spell out the whole API in the FLIP, though. We can add
> things as they come up.
>
> Best,
> Aljoscha
>
> On 24.07.20 14:43, Shuiqiang Chen wrote:
> > Hi David,
> >
> > Thank you for your reply! I have started the vote for this FLIP, but we
> can
> > keep the discussion on this thread.
> > In my perspective, I would not against adding the
> > DataStream.partitionCustom to Python DataStream API.  However, more
> inputs
> > are welcomed.
> >
> > Best,
> > Shuiqiang
> >
> >
> >
> > David Anderson <da...@alpinegizmo.com> 于2020年7月24日周五 下午7:52写道：
> >
> >> Sorry I'm coming to this rather late, but I would like to argue that
> >> DataStream.partitionCustom enables an important use case.
> >> What I have in mind is performing partitioned enrichment, where each
> >> instance can preload a slice of a static dataset that is being used for
> >> enrichment.
> >>
> >> For an example, consider
> >>
> https://github.com/knaufk/enrichments-with-flink/blob/master/src/main/java/com/github/knaufk/enrichments/CustomPartitionEnrichmenttJob.java
> >> .
> >>
> >> Regards,
> >> David
> >>
> >> On Fri, Jul 24, 2020 at 12:18 PM Shuiqiang Chen <ac...@gmail.com>
> >> wrote:
> >>
> >>> Hi Aljoscha, Thank you for your response.  I'll keep these two helper
> >>> methods in the Python DataStream implementation.
> >>>
> >>> And thank you all for joining in the discussion. It seems that we have
> >>> reached a consensus. I will start a vote for this FLIP later today.
> >>>
> >>> Best,
> >>> Shuiqiang
> >>>
> >>> Hequn Cheng <he...@apache.org> 于2020年7月24日周五 下午5:29写道：
> >>>
> >>>> Thanks a lot for your valuable feedback and suggestions! @Aljoscha
> >>> Krettek
> >>>> <al...@apache.org>
> >>>> +1 to the vote.
> >>>>
> >>>> Best,
> >>>> Hequn
> >>>>
> >>>> On Fri, Jul 24, 2020 at 5:16 PM Aljoscha Krettek <aljoscha@apache.org
> >
> >>>> wrote:
> >>>>
> >>>>> Thanks for updating! And yes, I think it's ok to include the few
> >>> helper
> >>>>> methods such as "readFromFile" and "print".
> >>>>>
> >>>>> I think we can now proceed to a vote! Nice work, overall!
> >>>>>
> >>>>> Best,
> >>>>> Aljoscha
> >>>>>
> >>>>> On 16.07.20 17:16, Hequn Cheng wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> Thanks a lot for your discussions.
> >>>>>> I think Aljoscha makes good suggestions here! Those problematic APIs
> >>>>> should
> >>>>>> not be added to the new Python DataStream API.
> >>>>>>
> >>>>>> Only one item I want to add based on the reply from Shuiqiang:
> >>>>>> I would also tend to keep the readTextFile() method. Apart from
> >>>> print(),
> >>>>>> the readTextFile() may also be very helpful and frequently used for
> >>>>> playing
> >>>>>> with Flink.
> >>>>>> For example, it is used in our WordCount example[1] which is almost
> >>> the
> >>>>>> first Flink program that every beginner runs.
> >>>>>> It is more efficient for reading multi-line data compared to
> >>>>>> fromCollection() meanwhile far more easier to be used compared to
> >>>> Kafka,
> >>>>>> Kinesis, RabbitMQ,etc., in
> >>>>>> cases for playing with Flink.
> >>>>>>
> >>>>>> What do you think?
> >>>>>>
> >>>>>> Best,
> >>>>>> Hequn
> >>>>>>
> >>>>>> [1]
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <acqua.csq@gmail.com
> >>>>
> >>>>> wrote:
> >>>>>>
> >>>>>>> Hi Aljoscha,
> >>>>>>>
> >>>>>>> Thank you for your valuable comments! I agree with you that there
> >>> is
> >>>>> some
> >>>>>>> optimization space for existing API and can be applied to the
> >>> python
> >>>>>>> DataStream API implementation.
> >>>>>>>
> >>>>>>> According to your comments, I have concluded them into the
> >>> following
> >>>>> parts:
> >>>>>>>
> >>>>>>> 1. SingleOutputStreamOperator and DataStreamSource.
> >>>>>>> Yes, the SingleOutputStreamOperator and DataStreamSource are a bit
> >>>>>>> redundant, so we can unify their APIs into DataStream to make it
> >>> more
> >>>>>>> clear.
> >>>>>>>
> >>>>>>> 2. The internal or low-level methods.
> >>>>>>>    - DataStream.get_id(): Has been removed in the FLIP wiki page.
> >>>>>>>    - DataStream.partition_custom(): Has been removed in the FLIP
> >>> wiki
> >>>>> page.
> >>>>>>>    - SingleOutputStreamOperator.can_be_parallel/forceNoParallel:
> Has
> >>>> been
> >>>>>>> removed in the FLIP wiki page.
> >>>>>>> Sorry for mistakenly making those internal methods public, we would
> >>>> not
> >>>>>>> expose them to users in the Python API.
> >>>>>>>
> >>>>>>> 3. "declarative" Apis.
> >>>>>>> - KeyedStream.sum/min/max/min_by/max_by: Has been removed in the
> >>> FLIP
> >>>>> wiki
> >>>>>>> page. They could be well covered by Table API.
> >>>>>>>
> >>>>>>> 4. Spelling problems.
> >>>>>>> - StreamExecutionEnvironment.from_collections. Should be
> >>>>> from_collection().
> >>>>>>> - StreamExecutionEnvironment.generate_sequenece. Should be
> >>>>>>> generate_sequence().
> >>>>>>> Sorry for the spelling error.
> >>>>>>>
> >>>>>>> 5. Predefined source and sink.
> >>>>>>> As you said, most of the predefined sources are not suitable for
> >>>>>>> production, we can ignore them in the new Python DataStream API.
> >>>>>>> There is one exception that maybe I think we should add the print()
> >>>>> since
> >>>>>>> it is commonly used by users and it is very useful for debugging
> >>> jobs.
> >>>>> We
> >>>>>>> can add comments for the API that it should never be used for
> >>>>> production.
> >>>>>>> Meanwhile, as you mentioned, a good alternative that always prints
> >>> on
> >>>>> the
> >>>>>>> client should also be supported. For this case, maybe we can add
> >>> the
> >>>>>>> collect method and return an Iterator. With the iterator, uses can
> >>>> print
> >>>>>>> the content on the client. This is also consistent with the
> >>> behavior
> >>>> in
> >>>>>>> Table API.
> >>>>>>>
> >>>>>>> 6. For Row.
> >>>>>>> Do you mean that we should not expose the Row type in Python API?
> >>>> Maybe
> >>>>> I
> >>>>>>> haven't gotten your concerns well.
> >>>>>>> We can use tuple type in Python DataStream to support Row. (I have
> >>>>> updated
> >>>>>>> the example section of the FLIP to reflect the design.)
> >>>>>>>
> >>>>>>> Highly appreciated for your suggestions again. Looking forward to
> >>> your
> >>>>>>> feedback.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Shuiqiang
> >>>>>>>
> >>>>>>> Aljoscha Krettek <al...@apache.org> 于2020年7月15日周三 下午5:58写道：
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> thanks for the proposal! I have some comments about the API. We
> >>>> should
> >>>>>>> not
> >>>>>>>> blindly copy the existing Java DataSteam because we made some
> >>>> mistakes
> >>>>>>> with
> >>>>>>>> that and we now have a chance to fix them and not forward them to
> >>> a
> >>>> new
> >>>>>>> API.
> >>>>>>>>
> >>>>>>>> I don't think we need SingleOutputStreamOperator, in the Scala
> >>> API we
> >>>>>>> just
> >>>>>>>> have DataStream and the relevant methods from
> >>>>> SingleOutputStreamOperator
> >>>>>>>> are added to DataStream. Having this extra type is more confusing
> >>>> than
> >>>>>>>> helpful to users, I think. In the same vain, I think we also don't
> >>>> need
> >>>>>>>> DataStreamSource. The source methods can also just return a
> >>>> DataStream.
> >>>>>>>>
> >>>>>>>> There are some methods that I would consider internal and we
> >>>> shouldn't
> >>>>>>>> expose them:
> >>>>>>>>    - DataStream.get_id(): this is an internal method
> >>>>>>>>    - DataStream.partition_custom(): I think adding this method was
> >>> a
> >>>>>>> mistake
> >>>>>>>> because it's to low-level, I could be convinced otherwise
> >>>>>>>>    - DataStream.print()/DataStream.print_to_error(): These are
> >>>>> questionable
> >>>>>>>> because they print to the TaskManager log. Maybe we could add a
> >>> good
> >>>>>>>> alternative that always prints on the client, similar to the Table
> >>>> API
> >>>>>>>>    - DataStream.write_to_socket(): It was a mistake to add this
> >>> sink
> >>>> on
> >>>>>>>> DataStream it is not fault-tolerant and shouldn't be used in
> >>>> production
> >>>>>>>>
> >>>>>>>>    - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table
> API
> >>>>> should
> >>>>>>>> be used for "declarative" use cases and I think these methods
> >>> should
> >>>>> not
> >>>>>>> be
> >>>>>>>> in the DataStream API
> >>>>>>>>    - SingleOutputStreamOperator.can_be_parallel/forceNoParallel:
> >>> these
> >>>>> are
> >>>>>>>> internal methods
> >>>>>>>>
> >>>>>>>>    - StreamExecutionEnvironment.from_parallel_collection(): I
> think
> >>>> the
> >>>>>>>> usability is questionable
> >>>>>>>>    - StreamExecutionEnvironment.from_collections -> should be
> >>> called
> >>>>>>>> from_collection
> >>>>>>>>    - StreamExecutionEnvironment.generate_sequenece -> should be
> >>> called
> >>>>>>>> generate_sequence
> >>>>>>>>
> >>>>>>>> I think most of the predefined sources are questionable:
> >>>>>>>>    - fromParallelCollection: I don't know if this is useful
> >>>>>>>>    - readTextFile: most of the variants are not
> >>> useful/fault-tolerant
> >>>>>>>>    - readFile: same
> >>>>>>>>    - socketTextStream: also not useful except for toy examples
> >>>>>>>>    - createInput: also not useful, and it's legacy DataSet
> >>>> InputFormats
> >>>>>>>>
> >>>>>>>> I think we need to think hard whether we want to further expose
> >>> Row
> >>>> in
> >>>>>>> our
> >>>>>>>> APIs. I think adding it to flink-core was more an accident than
> >>>>> anything
> >>>>>>>> else but I can see that it would be useful for Python/Java
> >>> interop.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Aljoscha
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote:
> >>>>>>>>> Thanks for bring up this DISCUSS Shuiqiang!
> >>>>>>>>>
> >>>>>>>>> +1 for the proposal!
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Jincheng
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Xingbo Huang <hx...@gmail.com> 于2020年7月9日周四 上午10:41写道：
> >>>>>>>>>
> >>>>>>>>>> Hi Shuiqiang,
> >>>>>>>>>>
> >>>>>>>>>> Thanks a lot for driving this discussion.
> >>>>>>>>>> Big +1 for supporting Python DataStream.
> >>>>>>>>>> In many ML scenarios, operating Object will be more natural than
> >>>>>>>> operating
> >>>>>>>>>> Table.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Xingbo
> >>>>>>>>>>
> >>>>>>>>>> Wei Zhong <we...@gmail.com> 于2020年7月9日周四 上午10:35写道：
> >>>>>>>>>>
> >>>>>>>>>>> Hi Shuiqiang,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for driving this. Big +1 for supporting DataStream API
> >>> in
> >>>>>>>> PyFlink!
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Wei
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> 在 2020年7月9日，10:29，Hequn Cheng <he...@apache.org> 写道：
> >>>>>>>>>>>>
> >>>>>>>>>>>> +1 for adding the Python DataStream API and starting with the
> >>>>>>>> stateless
> >>>>>>>>>>>> part.
> >>>>>>>>>>>> There are already some users that expressed their wish to have
> >>>>>>> the
> >>>>>>>>>> Python
> >>>>>>>>>>>> DataStream APIs. Once we have the APIs in PyFlink, we can
> >>> cover
> >>>>>>>> more
> >>>>>>>>>> use
> >>>>>>>>>>>> cases for our users.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best, Hequn
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <
> >>>>>>>> acqua.csq@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Sorry, the 3rd link is broken, please refer to this one:
> >>> Support
> >>>>>>>>>> Python
> >>>>>>>>>>>>> DataStream API
> >>>>>>>>>>>>> <
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Shuiqiang Chen <ac...@gmail.com> 于2020年7月8日周三 上午11:13写道：
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> As we all know, Flink provides three layered APIs: the
> >>>>>>>>>>> ProcessFunctions,
> >>>>>>>>>>>>>> the DataStream API and the SQL & Table API. Each API offers
> >>> a
> >>>>>>>>>> different
> >>>>>>>>>>>>>> trade-off between conciseness and expressiveness and targets
> >>>>>>>>>> different
> >>>>>>>>>>>>> use
> >>>>>>>>>>>>>> cases[1].
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Currently, the SQL & Table API has already been supported in
> >>>>>>>> PyFlink.
> >>>>>>>>>>> The
> >>>>>>>>>>>>>> API provides relational operations as well as user-defined
> >>>>>>>> functions
> >>>>>>>>>> to
> >>>>>>>>>>>>>> provide convenience for users who are familiar with python
> >>> and
> >>>>>>>>>>> relational
> >>>>>>>>>>>>>> programming.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Meanwhile, the DataStream API and ProcessFunctions provide
> >>> more
> >>>>>>>>>> generic
> >>>>>>>>>>>>>> APIs to implement stream processing applications. The
> >>>>>>>>>> ProcessFunctions
> >>>>>>>>>>>>>> expose time and state which are the fundamental building
> >>> blocks
> >>>>>>>> for
> >>>>>>>>>> any
> >>>>>>>>>>>>>> kind of streaming application.
> >>>>>>>>>>>>>> To cover more use cases, we are planning to cover all these
> >>>>>>> APIs
> >>>>>>>> in
> >>>>>>>>>>>>>> PyFlink.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In this discussion(FLIP-130), we propose to support the
> >>> Python
> >>>>>>>>>>> DataStream
> >>>>>>>>>>>>>> API for the stateless part. For more detail, please refer to
> >>>>>>> the
> >>>>>>>> FLIP
> >>>>>>>>>>>>> wiki
> >>>>>>>>>>>>>> page here[2]. If interested in the stateful part, you can
> >>> also
> >>>>>>>> take a
> >>>>>>>>>>>>>> look the design doc here[3] for which we are going to
> >>> discuss
> >>>>>>> in
> >>>>>>>> a
> >>>>>>>>>>>>> separate
> >>>>>>>>>>>>>> FLIP.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Any comments will be highly appreciated!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> [1]
> >>>>>>>> https://flink.apache.org/flink-applications.html#layered-apis
> >>>>>>>>>>>>>> [2]
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
> >>>>>>>>>>>>>> [3]
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Shuiqiang
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Posted by Aljoscha Krettek <al...@apache.org>.

Hi,

I'm also not against adding that if it enables actual use cases. I don't 
think we need to spell out the whole API in the FLIP, though. We can add 
things as they come up.

Best,
Aljoscha

On 24.07.20 14:43, Shuiqiang Chen wrote:
> Hi David,
> 
> Thank you for your reply! I have started the vote for this FLIP, but we can
> keep the discussion on this thread.
> In my perspective, I would not against adding the
> DataStream.partitionCustom to Python DataStream API.  However, more inputs
> are welcomed.
> 
> Best,
> Shuiqiang
> 
> 
> 
> David Anderson <da...@alpinegizmo.com> 于2020年7月24日周五 下午7:52写道：
> 
>> Sorry I'm coming to this rather late, but I would like to argue that
>> DataStream.partitionCustom enables an important use case.
>> What I have in mind is performing partitioned enrichment, where each
>> instance can preload a slice of a static dataset that is being used for
>> enrichment.
>>
>> For an example, consider
>> https://github.com/knaufk/enrichments-with-flink/blob/master/src/main/java/com/github/knaufk/enrichments/CustomPartitionEnrichmenttJob.java
>> .
>>
>> Regards,
>> David
>>
>> On Fri, Jul 24, 2020 at 12:18 PM Shuiqiang Chen <ac...@gmail.com>
>> wrote:
>>
>>> Hi Aljoscha, Thank you for your response.  I'll keep these two helper
>>> methods in the Python DataStream implementation.
>>>
>>> And thank you all for joining in the discussion. It seems that we have
>>> reached a consensus. I will start a vote for this FLIP later today.
>>>
>>> Best,
>>> Shuiqiang
>>>
>>> Hequn Cheng <he...@apache.org> 于2020年7月24日周五 下午5:29写道：
>>>
>>>> Thanks a lot for your valuable feedback and suggestions! @Aljoscha
>>> Krettek
>>>> <al...@apache.org>
>>>> +1 to the vote.
>>>>
>>>> Best,
>>>> Hequn
>>>>
>>>> On Fri, Jul 24, 2020 at 5:16 PM Aljoscha Krettek <al...@apache.org>
>>>> wrote:
>>>>
>>>>> Thanks for updating! And yes, I think it's ok to include the few
>>> helper
>>>>> methods such as "readFromFile" and "print".
>>>>>
>>>>> I think we can now proceed to a vote! Nice work, overall!
>>>>>
>>>>> Best,
>>>>> Aljoscha
>>>>>
>>>>> On 16.07.20 17:16, Hequn Cheng wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Thanks a lot for your discussions.
>>>>>> I think Aljoscha makes good suggestions here! Those problematic APIs
>>>>> should
>>>>>> not be added to the new Python DataStream API.
>>>>>>
>>>>>> Only one item I want to add based on the reply from Shuiqiang:
>>>>>> I would also tend to keep the readTextFile() method. Apart from
>>>> print(),
>>>>>> the readTextFile() may also be very helpful and frequently used for
>>>>> playing
>>>>>> with Flink.
>>>>>> For example, it is used in our WordCount example[1] which is almost
>>> the
>>>>>> first Flink program that every beginner runs.
>>>>>> It is more efficient for reading multi-line data compared to
>>>>>> fromCollection() meanwhile far more easier to be used compared to
>>>> Kafka,
>>>>>> Kinesis, RabbitMQ,etc., in
>>>>>> cases for playing with Flink.
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Best,
>>>>>> Hequn
>>>>>>
>>>>>> [1]
>>>>>>
>>>>>
>>>>
>>> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <acqua.csq@gmail.com
>>>>
>>>>> wrote:
>>>>>>
>>>>>>> Hi Aljoscha,
>>>>>>>
>>>>>>> Thank you for your valuable comments! I agree with you that there
>>> is
>>>>> some
>>>>>>> optimization space for existing API and can be applied to the
>>> python
>>>>>>> DataStream API implementation.
>>>>>>>
>>>>>>> According to your comments, I have concluded them into the
>>> following
>>>>> parts:
>>>>>>>
>>>>>>> 1. SingleOutputStreamOperator and DataStreamSource.
>>>>>>> Yes, the SingleOutputStreamOperator and DataStreamSource are a bit
>>>>>>> redundant, so we can unify their APIs into DataStream to make it
>>> more
>>>>>>> clear.
>>>>>>>
>>>>>>> 2. The internal or low-level methods.
>>>>>>>    - DataStream.get_id(): Has been removed in the FLIP wiki page.
>>>>>>>    - DataStream.partition_custom(): Has been removed in the FLIP
>>> wiki
>>>>> page.
>>>>>>>    - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: Has
>>>> been
>>>>>>> removed in the FLIP wiki page.
>>>>>>> Sorry for mistakenly making those internal methods public, we would
>>>> not
>>>>>>> expose them to users in the Python API.
>>>>>>>
>>>>>>> 3. "declarative" Apis.
>>>>>>> - KeyedStream.sum/min/max/min_by/max_by: Has been removed in the
>>> FLIP
>>>>> wiki
>>>>>>> page. They could be well covered by Table API.
>>>>>>>
>>>>>>> 4. Spelling problems.
>>>>>>> - StreamExecutionEnvironment.from_collections. Should be
>>>>> from_collection().
>>>>>>> - StreamExecutionEnvironment.generate_sequenece. Should be
>>>>>>> generate_sequence().
>>>>>>> Sorry for the spelling error.
>>>>>>>
>>>>>>> 5. Predefined source and sink.
>>>>>>> As you said, most of the predefined sources are not suitable for
>>>>>>> production, we can ignore them in the new Python DataStream API.
>>>>>>> There is one exception that maybe I think we should add the print()
>>>>> since
>>>>>>> it is commonly used by users and it is very useful for debugging
>>> jobs.
>>>>> We
>>>>>>> can add comments for the API that it should never be used for
>>>>> production.
>>>>>>> Meanwhile, as you mentioned, a good alternative that always prints
>>> on
>>>>> the
>>>>>>> client should also be supported. For this case, maybe we can add
>>> the
>>>>>>> collect method and return an Iterator. With the iterator, uses can
>>>> print
>>>>>>> the content on the client. This is also consistent with the
>>> behavior
>>>> in
>>>>>>> Table API.
>>>>>>>
>>>>>>> 6. For Row.
>>>>>>> Do you mean that we should not expose the Row type in Python API?
>>>> Maybe
>>>>> I
>>>>>>> haven't gotten your concerns well.
>>>>>>> We can use tuple type in Python DataStream to support Row. (I have
>>>>> updated
>>>>>>> the example section of the FLIP to reflect the design.)
>>>>>>>
>>>>>>> Highly appreciated for your suggestions again. Looking forward to
>>> your
>>>>>>> feedback.
>>>>>>>
>>>>>>> Best,
>>>>>>> Shuiqiang
>>>>>>>
>>>>>>> Aljoscha Krettek <al...@apache.org> 于2020年7月15日周三 下午5:58写道：
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> thanks for the proposal! I have some comments about the API. We
>>>> should
>>>>>>> not
>>>>>>>> blindly copy the existing Java DataSteam because we made some
>>>> mistakes
>>>>>>> with
>>>>>>>> that and we now have a chance to fix them and not forward them to
>>> a
>>>> new
>>>>>>> API.
>>>>>>>>
>>>>>>>> I don't think we need SingleOutputStreamOperator, in the Scala
>>> API we
>>>>>>> just
>>>>>>>> have DataStream and the relevant methods from
>>>>> SingleOutputStreamOperator
>>>>>>>> are added to DataStream. Having this extra type is more confusing
>>>> than
>>>>>>>> helpful to users, I think. In the same vain, I think we also don't
>>>> need
>>>>>>>> DataStreamSource. The source methods can also just return a
>>>> DataStream.
>>>>>>>>
>>>>>>>> There are some methods that I would consider internal and we
>>>> shouldn't
>>>>>>>> expose them:
>>>>>>>>    - DataStream.get_id(): this is an internal method
>>>>>>>>    - DataStream.partition_custom(): I think adding this method was
>>> a
>>>>>>> mistake
>>>>>>>> because it's to low-level, I could be convinced otherwise
>>>>>>>>    - DataStream.print()/DataStream.print_to_error(): These are
>>>>> questionable
>>>>>>>> because they print to the TaskManager log. Maybe we could add a
>>> good
>>>>>>>> alternative that always prints on the client, similar to the Table
>>>> API
>>>>>>>>    - DataStream.write_to_socket(): It was a mistake to add this
>>> sink
>>>> on
>>>>>>>> DataStream it is not fault-tolerant and shouldn't be used in
>>>> production
>>>>>>>>
>>>>>>>>    - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API
>>>>> should
>>>>>>>> be used for "declarative" use cases and I think these methods
>>> should
>>>>> not
>>>>>>> be
>>>>>>>> in the DataStream API
>>>>>>>>    - SingleOutputStreamOperator.can_be_parallel/forceNoParallel:
>>> these
>>>>> are
>>>>>>>> internal methods
>>>>>>>>
>>>>>>>>    - StreamExecutionEnvironment.from_parallel_collection(): I think
>>>> the
>>>>>>>> usability is questionable
>>>>>>>>    - StreamExecutionEnvironment.from_collections -> should be
>>> called
>>>>>>>> from_collection
>>>>>>>>    - StreamExecutionEnvironment.generate_sequenece -> should be
>>> called
>>>>>>>> generate_sequence
>>>>>>>>
>>>>>>>> I think most of the predefined sources are questionable:
>>>>>>>>    - fromParallelCollection: I don't know if this is useful
>>>>>>>>    - readTextFile: most of the variants are not
>>> useful/fault-tolerant
>>>>>>>>    - readFile: same
>>>>>>>>    - socketTextStream: also not useful except for toy examples
>>>>>>>>    - createInput: also not useful, and it's legacy DataSet
>>>> InputFormats
>>>>>>>>
>>>>>>>> I think we need to think hard whether we want to further expose
>>> Row
>>>> in
>>>>>>> our
>>>>>>>> APIs. I think adding it to flink-core was more an accident than
>>>>> anything
>>>>>>>> else but I can see that it would be useful for Python/Java
>>> interop.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Aljoscha
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote:
>>>>>>>>> Thanks for bring up this DISCUSS Shuiqiang!
>>>>>>>>>
>>>>>>>>> +1 for the proposal!
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Jincheng
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Xingbo Huang <hx...@gmail.com> 于2020年7月9日周四 上午10:41写道：
>>>>>>>>>
>>>>>>>>>> Hi Shuiqiang,
>>>>>>>>>>
>>>>>>>>>> Thanks a lot for driving this discussion.
>>>>>>>>>> Big +1 for supporting Python DataStream.
>>>>>>>>>> In many ML scenarios, operating Object will be more natural than
>>>>>>>> operating
>>>>>>>>>> Table.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Xingbo
>>>>>>>>>>
>>>>>>>>>> Wei Zhong <we...@gmail.com> 于2020年7月9日周四 上午10:35写道：
>>>>>>>>>>
>>>>>>>>>>> Hi Shuiqiang,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for driving this. Big +1 for supporting DataStream API
>>> in
>>>>>>>> PyFlink!
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Wei
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> 在 2020年7月9日，10:29，Hequn Cheng <he...@apache.org> 写道：
>>>>>>>>>>>>
>>>>>>>>>>>> +1 for adding the Python DataStream API and starting with the
>>>>>>>> stateless
>>>>>>>>>>>> part.
>>>>>>>>>>>> There are already some users that expressed their wish to have
>>>>>>> the
>>>>>>>>>> Python
>>>>>>>>>>>> DataStream APIs. Once we have the APIs in PyFlink, we can
>>> cover
>>>>>>>> more
>>>>>>>>>> use
>>>>>>>>>>>> cases for our users.
>>>>>>>>>>>>
>>>>>>>>>>>> Best, Hequn
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <
>>>>>>>> acqua.csq@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry, the 3rd link is broken, please refer to this one:
>>> Support
>>>>>>>>>> Python
>>>>>>>>>>>>> DataStream API
>>>>>>>>>>>>> <
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Shuiqiang Chen <ac...@gmail.com> 于2020年7月8日周三 上午11:13写道：
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As we all know, Flink provides three layered APIs: the
>>>>>>>>>>> ProcessFunctions,
>>>>>>>>>>>>>> the DataStream API and the SQL & Table API. Each API offers
>>> a
>>>>>>>>>> different
>>>>>>>>>>>>>> trade-off between conciseness and expressiveness and targets
>>>>>>>>>> different
>>>>>>>>>>>>> use
>>>>>>>>>>>>>> cases[1].
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Currently, the SQL & Table API has already been supported in
>>>>>>>> PyFlink.
>>>>>>>>>>> The
>>>>>>>>>>>>>> API provides relational operations as well as user-defined
>>>>>>>> functions
>>>>>>>>>> to
>>>>>>>>>>>>>> provide convenience for users who are familiar with python
>>> and
>>>>>>>>>>> relational
>>>>>>>>>>>>>> programming.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Meanwhile, the DataStream API and ProcessFunctions provide
>>> more
>>>>>>>>>> generic
>>>>>>>>>>>>>> APIs to implement stream processing applications. The
>>>>>>>>>> ProcessFunctions
>>>>>>>>>>>>>> expose time and state which are the fundamental building
>>> blocks
>>>>>>>> for
>>>>>>>>>> any
>>>>>>>>>>>>>> kind of streaming application.
>>>>>>>>>>>>>> To cover more use cases, we are planning to cover all these
>>>>>>> APIs
>>>>>>>> in
>>>>>>>>>>>>>> PyFlink.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In this discussion(FLIP-130), we propose to support the
>>> Python
>>>>>>>>>>> DataStream
>>>>>>>>>>>>>> API for the stateless part. For more detail, please refer to
>>>>>>> the
>>>>>>>> FLIP
>>>>>>>>>>>>> wiki
>>>>>>>>>>>>>> page here[2]. If interested in the stateful part, you can
>>> also
>>>>>>>> take a
>>>>>>>>>>>>>> look the design doc here[3] for which we are going to
>>> discuss
>>>>>>> in
>>>>>>>> a
>>>>>>>>>>>>> separate
>>>>>>>>>>>>>> FLIP.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Any comments will be highly appreciated!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>> https://flink.apache.org/flink-applications.html#layered-apis
>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
>>>>>>>>>>>>>> [3]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Shuiqiang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Posted by Shuiqiang Chen <ac...@gmail.com>.

Hi David,

Thank you for your reply! I have started the vote for this FLIP, but we can
keep the discussion on this thread.
In my perspective, I would not against adding the
DataStream.partitionCustom to Python DataStream API.  However, more inputs
are welcomed.

Best,
Shuiqiang



David Anderson <da...@alpinegizmo.com> 于2020年7月24日周五 下午7:52写道：

> Sorry I'm coming to this rather late, but I would like to argue that
> DataStream.partitionCustom enables an important use case.
> What I have in mind is performing partitioned enrichment, where each
> instance can preload a slice of a static dataset that is being used for
> enrichment.
>
> For an example, consider
> https://github.com/knaufk/enrichments-with-flink/blob/master/src/main/java/com/github/knaufk/enrichments/CustomPartitionEnrichmenttJob.java
> .
>
> Regards,
> David
>
> On Fri, Jul 24, 2020 at 12:18 PM Shuiqiang Chen <ac...@gmail.com>
> wrote:
>
>> Hi Aljoscha, Thank you for your response.  I'll keep these two helper
>> methods in the Python DataStream implementation.
>>
>> And thank you all for joining in the discussion. It seems that we have
>> reached a consensus. I will start a vote for this FLIP later today.
>>
>> Best,
>> Shuiqiang
>>
>> Hequn Cheng <he...@apache.org> 于2020年7月24日周五 下午5:29写道：
>>
>> > Thanks a lot for your valuable feedback and suggestions! @Aljoscha
>> Krettek
>> > <al...@apache.org>
>> > +1 to the vote.
>> >
>> > Best,
>> > Hequn
>> >
>> > On Fri, Jul 24, 2020 at 5:16 PM Aljoscha Krettek <al...@apache.org>
>> > wrote:
>> >
>> > > Thanks for updating! And yes, I think it's ok to include the few
>> helper
>> > > methods such as "readFromFile" and "print".
>> > >
>> > > I think we can now proceed to a vote! Nice work, overall!
>> > >
>> > > Best,
>> > > Aljoscha
>> > >
>> > > On 16.07.20 17:16, Hequn Cheng wrote:
>> > > > Hi,
>> > > >
>> > > > Thanks a lot for your discussions.
>> > > > I think Aljoscha makes good suggestions here! Those problematic APIs
>> > > should
>> > > > not be added to the new Python DataStream API.
>> > > >
>> > > > Only one item I want to add based on the reply from Shuiqiang:
>> > > > I would also tend to keep the readTextFile() method. Apart from
>> > print(),
>> > > > the readTextFile() may also be very helpful and frequently used for
>> > > playing
>> > > > with Flink.
>> > > > For example, it is used in our WordCount example[1] which is almost
>> the
>> > > > first Flink program that every beginner runs.
>> > > > It is more efficient for reading multi-line data compared to
>> > > > fromCollection() meanwhile far more easier to be used compared to
>> > Kafka,
>> > > > Kinesis, RabbitMQ,etc., in
>> > > > cases for playing with Flink.
>> > > >
>> > > > What do you think?
>> > > >
>> > > > Best,
>> > > > Hequn
>> > > >
>> > > > [1]
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java
>> > > >
>> > > >
>> > > > On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <acqua.csq@gmail.com
>> >
>> > > wrote:
>> > > >
>> > > >> Hi Aljoscha,
>> > > >>
>> > > >> Thank you for your valuable comments! I agree with you that there
>> is
>> > > some
>> > > >> optimization space for existing API and can be applied to the
>> python
>> > > >> DataStream API implementation.
>> > > >>
>> > > >> According to your comments, I have concluded them into the
>> following
>> > > parts:
>> > > >>
>> > > >> 1. SingleOutputStreamOperator and DataStreamSource.
>> > > >> Yes, the SingleOutputStreamOperator and DataStreamSource are a bit
>> > > >> redundant, so we can unify their APIs into DataStream to make it
>> more
>> > > >> clear.
>> > > >>
>> > > >> 2. The internal or low-level methods.
>> > > >>   - DataStream.get_id(): Has been removed in the FLIP wiki page.
>> > > >>   - DataStream.partition_custom(): Has been removed in the FLIP
>> wiki
>> > > page.
>> > > >>   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: Has
>> > been
>> > > >> removed in the FLIP wiki page.
>> > > >> Sorry for mistakenly making those internal methods public, we would
>> > not
>> > > >> expose them to users in the Python API.
>> > > >>
>> > > >> 3. "declarative" Apis.
>> > > >> - KeyedStream.sum/min/max/min_by/max_by: Has been removed in the
>> FLIP
>> > > wiki
>> > > >> page. They could be well covered by Table API.
>> > > >>
>> > > >> 4. Spelling problems.
>> > > >> - StreamExecutionEnvironment.from_collections. Should be
>> > > from_collection().
>> > > >> - StreamExecutionEnvironment.generate_sequenece. Should be
>> > > >> generate_sequence().
>> > > >> Sorry for the spelling error.
>> > > >>
>> > > >> 5. Predefined source and sink.
>> > > >> As you said, most of the predefined sources are not suitable for
>> > > >> production, we can ignore them in the new Python DataStream API.
>> > > >> There is one exception that maybe I think we should add the print()
>> > > since
>> > > >> it is commonly used by users and it is very useful for debugging
>> jobs.
>> > > We
>> > > >> can add comments for the API that it should never be used for
>> > > production.
>> > > >> Meanwhile, as you mentioned, a good alternative that always prints
>> on
>> > > the
>> > > >> client should also be supported. For this case, maybe we can add
>> the
>> > > >> collect method and return an Iterator. With the iterator, uses can
>> > print
>> > > >> the content on the client. This is also consistent with the
>> behavior
>> > in
>> > > >> Table API.
>> > > >>
>> > > >> 6. For Row.
>> > > >> Do you mean that we should not expose the Row type in Python API?
>> > Maybe
>> > > I
>> > > >> haven't gotten your concerns well.
>> > > >> We can use tuple type in Python DataStream to support Row. (I have
>> > > updated
>> > > >> the example section of the FLIP to reflect the design.)
>> > > >>
>> > > >> Highly appreciated for your suggestions again. Looking forward to
>> your
>> > > >> feedback.
>> > > >>
>> > > >> Best,
>> > > >> Shuiqiang
>> > > >>
>> > > >> Aljoscha Krettek <al...@apache.org> 于2020年7月15日周三 下午5:58写道：
>> > > >>
>> > > >>> Hi,
>> > > >>>
>> > > >>> thanks for the proposal! I have some comments about the API. We
>> > should
>> > > >> not
>> > > >>> blindly copy the existing Java DataSteam because we made some
>> > mistakes
>> > > >> with
>> > > >>> that and we now have a chance to fix them and not forward them to
>> a
>> > new
>> > > >> API.
>> > > >>>
>> > > >>> I don't think we need SingleOutputStreamOperator, in the Scala
>> API we
>> > > >> just
>> > > >>> have DataStream and the relevant methods from
>> > > SingleOutputStreamOperator
>> > > >>> are added to DataStream. Having this extra type is more confusing
>> > than
>> > > >>> helpful to users, I think. In the same vain, I think we also don't
>> > need
>> > > >>> DataStreamSource. The source methods can also just return a
>> > DataStream.
>> > > >>>
>> > > >>> There are some methods that I would consider internal and we
>> > shouldn't
>> > > >>> expose them:
>> > > >>>   - DataStream.get_id(): this is an internal method
>> > > >>>   - DataStream.partition_custom(): I think adding this method was
>> a
>> > > >> mistake
>> > > >>> because it's to low-level, I could be convinced otherwise
>> > > >>>   - DataStream.print()/DataStream.print_to_error(): These are
>> > > questionable
>> > > >>> because they print to the TaskManager log. Maybe we could add a
>> good
>> > > >>> alternative that always prints on the client, similar to the Table
>> > API
>> > > >>>   - DataStream.write_to_socket(): It was a mistake to add this
>> sink
>> > on
>> > > >>> DataStream it is not fault-tolerant and shouldn't be used in
>> > production
>> > > >>>
>> > > >>>   - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API
>> > > should
>> > > >>> be used for "declarative" use cases and I think these methods
>> should
>> > > not
>> > > >> be
>> > > >>> in the DataStream API
>> > > >>>   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel:
>> these
>> > > are
>> > > >>> internal methods
>> > > >>>
>> > > >>>   - StreamExecutionEnvironment.from_parallel_collection(): I think
>> > the
>> > > >>> usability is questionable
>> > > >>>   - StreamExecutionEnvironment.from_collections -> should be
>> called
>> > > >>> from_collection
>> > > >>>   - StreamExecutionEnvironment.generate_sequenece -> should be
>> called
>> > > >>> generate_sequence
>> > > >>>
>> > > >>> I think most of the predefined sources are questionable:
>> > > >>>   - fromParallelCollection: I don't know if this is useful
>> > > >>>   - readTextFile: most of the variants are not
>> useful/fault-tolerant
>> > > >>>   - readFile: same
>> > > >>>   - socketTextStream: also not useful except for toy examples
>> > > >>>   - createInput: also not useful, and it's legacy DataSet
>> > InputFormats
>> > > >>>
>> > > >>> I think we need to think hard whether we want to further expose
>> Row
>> > in
>> > > >> our
>> > > >>> APIs. I think adding it to flink-core was more an accident than
>> > > anything
>> > > >>> else but I can see that it would be useful for Python/Java
>> interop.
>> > > >>>
>> > > >>> Best,
>> > > >>> Aljoscha
>> > > >>>
>> > > >>>
>> > > >>> On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote:
>> > > >>>> Thanks for bring up this DISCUSS Shuiqiang!
>> > > >>>>
>> > > >>>> +1 for the proposal!
>> > > >>>>
>> > > >>>> Best,
>> > > >>>> Jincheng
>> > > >>>>
>> > > >>>>
>> > > >>>> Xingbo Huang <hx...@gmail.com> 于2020年7月9日周四 上午10:41写道：
>> > > >>>>
>> > > >>>>> Hi Shuiqiang,
>> > > >>>>>
>> > > >>>>> Thanks a lot for driving this discussion.
>> > > >>>>> Big +1 for supporting Python DataStream.
>> > > >>>>> In many ML scenarios, operating Object will be more natural than
>> > > >>> operating
>> > > >>>>> Table.
>> > > >>>>>
>> > > >>>>> Best,
>> > > >>>>> Xingbo
>> > > >>>>>
>> > > >>>>> Wei Zhong <we...@gmail.com> 于2020年7月9日周四 上午10:35写道：
>> > > >>>>>
>> > > >>>>>> Hi Shuiqiang,
>> > > >>>>>>
>> > > >>>>>> Thanks for driving this. Big +1 for supporting DataStream API
>> in
>> > > >>> PyFlink!
>> > > >>>>>>
>> > > >>>>>> Best,
>> > > >>>>>> Wei
>> > > >>>>>>
>> > > >>>>>>
>> > > >>>>>>> 在 2020年7月9日，10:29，Hequn Cheng <he...@apache.org> 写道：
>> > > >>>>>>>
>> > > >>>>>>> +1 for adding the Python DataStream API and starting with the
>> > > >>> stateless
>> > > >>>>>>> part.
>> > > >>>>>>> There are already some users that expressed their wish to have
>> > > >> the
>> > > >>>>> Python
>> > > >>>>>>> DataStream APIs. Once we have the APIs in PyFlink, we can
>> cover
>> > > >>> more
>> > > >>>>> use
>> > > >>>>>>> cases for our users.
>> > > >>>>>>>
>> > > >>>>>>> Best, Hequn
>> > > >>>>>>>
>> > > >>>>>>> On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <
>> > > >>> acqua.csq@gmail.com>
>> > > >>>>>> wrote:
>> > > >>>>>>>
>> > > >>>>>>>> Sorry, the 3rd link is broken, please refer to this one:
>> Support
>> > > >>>>> Python
>> > > >>>>>>>> DataStream API
>> > > >>>>>>>> <
>> > > >>>>>>>>
>> > > >>>>>>
>> > > >>>>>
>> > > >>>
>> > > >>
>> > >
>> >
>> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
>> > > >>>>>>>>>
>> > > >>>>>>>>
>> > > >>>>>>>> Shuiqiang Chen <ac...@gmail.com> 于2020年7月8日周三 上午11:13写道：
>> > > >>>>>>>>
>> > > >>>>>>>>> Hi everyone,
>> > > >>>>>>>>>
>> > > >>>>>>>>> As we all know, Flink provides three layered APIs: the
>> > > >>>>>> ProcessFunctions,
>> > > >>>>>>>>> the DataStream API and the SQL & Table API. Each API offers
>> a
>> > > >>>>> different
>> > > >>>>>>>>> trade-off between conciseness and expressiveness and targets
>> > > >>>>> different
>> > > >>>>>>>> use
>> > > >>>>>>>>> cases[1].
>> > > >>>>>>>>>
>> > > >>>>>>>>> Currently, the SQL & Table API has already been supported in
>> > > >>> PyFlink.
>> > > >>>>>> The
>> > > >>>>>>>>> API provides relational operations as well as user-defined
>> > > >>> functions
>> > > >>>>> to
>> > > >>>>>>>>> provide convenience for users who are familiar with python
>> and
>> > > >>>>>> relational
>> > > >>>>>>>>> programming.
>> > > >>>>>>>>>
>> > > >>>>>>>>> Meanwhile, the DataStream API and ProcessFunctions provide
>> more
>> > > >>>>> generic
>> > > >>>>>>>>> APIs to implement stream processing applications. The
>> > > >>>>> ProcessFunctions
>> > > >>>>>>>>> expose time and state which are the fundamental building
>> blocks
>> > > >>> for
>> > > >>>>> any
>> > > >>>>>>>>> kind of streaming application.
>> > > >>>>>>>>> To cover more use cases, we are planning to cover all these
>> > > >> APIs
>> > > >>> in
>> > > >>>>>>>>> PyFlink.
>> > > >>>>>>>>>
>> > > >>>>>>>>> In this discussion(FLIP-130), we propose to support the
>> Python
>> > > >>>>>> DataStream
>> > > >>>>>>>>> API for the stateless part. For more detail, please refer to
>> > > >> the
>> > > >>> FLIP
>> > > >>>>>>>> wiki
>> > > >>>>>>>>> page here[2]. If interested in the stateful part, you can
>> also
>> > > >>> take a
>> > > >>>>>>>>> look the design doc here[3] for which we are going to
>> discuss
>> > > >> in
>> > > >>> a
>> > > >>>>>>>> separate
>> > > >>>>>>>>> FLIP.
>> > > >>>>>>>>>
>> > > >>>>>>>>> Any comments will be highly appreciated!
>> > > >>>>>>>>>
>> > > >>>>>>>>> [1]
>> > > >>> https://flink.apache.org/flink-applications.html#layered-apis
>> > > >>>>>>>>> [2]
>> > > >>>>>>>>>
>> > > >>>>>>>>
>> > > >>>>>>
>> > > >>>>>
>> > > >>>
>> > > >>
>> > >
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
>> > > >>>>>>>>> [3]
>> > > >>>>>>>>>
>> > > >>>>>>>>
>> > > >>>>>>
>> > > >>>>>
>> > > >>>
>> > > >>
>> > >
>> >
>> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
>> > > >>>>>>>>>
>> > > >>>>>>>>> Best,
>> > > >>>>>>>>> Shuiqiang
>> > > >>>>>>>>>
>> > > >>>>>>>>>
>> > > >>>>>>>>>
>> > > >>>>>>>>>
>> > > >>>>>>>>
>> > > >>>>>>
>> > > >>>>>>
>> > > >>>>>
>> > > >>>>
>> > > >>>
>> > > >>
>> > > >
>> > >
>> > >
>> >
>>
>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Posted by David Anderson <da...@alpinegizmo.com>.

Sorry I'm coming to this rather late, but I would like to argue that
DataStream.partitionCustom enables an important use case.
What I have in mind is performing partitioned enrichment, where each
instance can preload a slice of a static dataset that is being used for
enrichment.

For an example, consider
https://github.com/knaufk/enrichments-with-flink/blob/master/src/main/java/com/github/knaufk/enrichments/CustomPartitionEnrichmenttJob.java
.

Regards,
David

On Fri, Jul 24, 2020 at 12:18 PM Shuiqiang Chen <ac...@gmail.com> wrote:

> Hi Aljoscha, Thank you for your response.  I'll keep these two helper
> methods in the Python DataStream implementation.
>
> And thank you all for joining in the discussion. It seems that we have
> reached a consensus. I will start a vote for this FLIP later today.
>
> Best,
> Shuiqiang
>
> Hequn Cheng <he...@apache.org> 于2020年7月24日周五 下午5:29写道：
>
> > Thanks a lot for your valuable feedback and suggestions! @Aljoscha
> Krettek
> > <al...@apache.org>
> > +1 to the vote.
> >
> > Best,
> > Hequn
> >
> > On Fri, Jul 24, 2020 at 5:16 PM Aljoscha Krettek <al...@apache.org>
> > wrote:
> >
> > > Thanks for updating! And yes, I think it's ok to include the few helper
> > > methods such as "readFromFile" and "print".
> > >
> > > I think we can now proceed to a vote! Nice work, overall!
> > >
> > > Best,
> > > Aljoscha
> > >
> > > On 16.07.20 17:16, Hequn Cheng wrote:
> > > > Hi,
> > > >
> > > > Thanks a lot for your discussions.
> > > > I think Aljoscha makes good suggestions here! Those problematic APIs
> > > should
> > > > not be added to the new Python DataStream API.
> > > >
> > > > Only one item I want to add based on the reply from Shuiqiang:
> > > > I would also tend to keep the readTextFile() method. Apart from
> > print(),
> > > > the readTextFile() may also be very helpful and frequently used for
> > > playing
> > > > with Flink.
> > > > For example, it is used in our WordCount example[1] which is almost
> the
> > > > first Flink program that every beginner runs.
> > > > It is more efficient for reading multi-line data compared to
> > > > fromCollection() meanwhile far more easier to be used compared to
> > Kafka,
> > > > Kinesis, RabbitMQ,etc., in
> > > > cases for playing with Flink.
> > > >
> > > > What do you think?
> > > >
> > > > Best,
> > > > Hequn
> > > >
> > > > [1]
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java
> > > >
> > > >
> > > > On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <ac...@gmail.com>
> > > wrote:
> > > >
> > > >> Hi Aljoscha,
> > > >>
> > > >> Thank you for your valuable comments! I agree with you that there is
> > > some
> > > >> optimization space for existing API and can be applied to the python
> > > >> DataStream API implementation.
> > > >>
> > > >> According to your comments, I have concluded them into the following
> > > parts:
> > > >>
> > > >> 1. SingleOutputStreamOperator and DataStreamSource.
> > > >> Yes, the SingleOutputStreamOperator and DataStreamSource are a bit
> > > >> redundant, so we can unify their APIs into DataStream to make it
> more
> > > >> clear.
> > > >>
> > > >> 2. The internal or low-level methods.
> > > >>   - DataStream.get_id(): Has been removed in the FLIP wiki page.
> > > >>   - DataStream.partition_custom(): Has been removed in the FLIP wiki
> > > page.
> > > >>   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: Has
> > been
> > > >> removed in the FLIP wiki page.
> > > >> Sorry for mistakenly making those internal methods public, we would
> > not
> > > >> expose them to users in the Python API.
> > > >>
> > > >> 3. "declarative" Apis.
> > > >> - KeyedStream.sum/min/max/min_by/max_by: Has been removed in the
> FLIP
> > > wiki
> > > >> page. They could be well covered by Table API.
> > > >>
> > > >> 4. Spelling problems.
> > > >> - StreamExecutionEnvironment.from_collections. Should be
> > > from_collection().
> > > >> - StreamExecutionEnvironment.generate_sequenece. Should be
> > > >> generate_sequence().
> > > >> Sorry for the spelling error.
> > > >>
> > > >> 5. Predefined source and sink.
> > > >> As you said, most of the predefined sources are not suitable for
> > > >> production, we can ignore them in the new Python DataStream API.
> > > >> There is one exception that maybe I think we should add the print()
> > > since
> > > >> it is commonly used by users and it is very useful for debugging
> jobs.
> > > We
> > > >> can add comments for the API that it should never be used for
> > > production.
> > > >> Meanwhile, as you mentioned, a good alternative that always prints
> on
> > > the
> > > >> client should also be supported. For this case, maybe we can add the
> > > >> collect method and return an Iterator. With the iterator, uses can
> > print
> > > >> the content on the client. This is also consistent with the behavior
> > in
> > > >> Table API.
> > > >>
> > > >> 6. For Row.
> > > >> Do you mean that we should not expose the Row type in Python API?
> > Maybe
> > > I
> > > >> haven't gotten your concerns well.
> > > >> We can use tuple type in Python DataStream to support Row. (I have
> > > updated
> > > >> the example section of the FLIP to reflect the design.)
> > > >>
> > > >> Highly appreciated for your suggestions again. Looking forward to
> your
> > > >> feedback.
> > > >>
> > > >> Best,
> > > >> Shuiqiang
> > > >>
> > > >> Aljoscha Krettek <al...@apache.org> 于2020年7月15日周三 下午5:58写道：
> > > >>
> > > >>> Hi,
> > > >>>
> > > >>> thanks for the proposal! I have some comments about the API. We
> > should
> > > >> not
> > > >>> blindly copy the existing Java DataSteam because we made some
> > mistakes
> > > >> with
> > > >>> that and we now have a chance to fix them and not forward them to a
> > new
> > > >> API.
> > > >>>
> > > >>> I don't think we need SingleOutputStreamOperator, in the Scala API
> we
> > > >> just
> > > >>> have DataStream and the relevant methods from
> > > SingleOutputStreamOperator
> > > >>> are added to DataStream. Having this extra type is more confusing
> > than
> > > >>> helpful to users, I think. In the same vain, I think we also don't
> > need
> > > >>> DataStreamSource. The source methods can also just return a
> > DataStream.
> > > >>>
> > > >>> There are some methods that I would consider internal and we
> > shouldn't
> > > >>> expose them:
> > > >>>   - DataStream.get_id(): this is an internal method
> > > >>>   - DataStream.partition_custom(): I think adding this method was a
> > > >> mistake
> > > >>> because it's to low-level, I could be convinced otherwise
> > > >>>   - DataStream.print()/DataStream.print_to_error(): These are
> > > questionable
> > > >>> because they print to the TaskManager log. Maybe we could add a
> good
> > > >>> alternative that always prints on the client, similar to the Table
> > API
> > > >>>   - DataStream.write_to_socket(): It was a mistake to add this sink
> > on
> > > >>> DataStream it is not fault-tolerant and shouldn't be used in
> > production
> > > >>>
> > > >>>   - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API
> > > should
> > > >>> be used for "declarative" use cases and I think these methods
> should
> > > not
> > > >> be
> > > >>> in the DataStream API
> > > >>>   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel:
> these
> > > are
> > > >>> internal methods
> > > >>>
> > > >>>   - StreamExecutionEnvironment.from_parallel_collection(): I think
> > the
> > > >>> usability is questionable
> > > >>>   - StreamExecutionEnvironment.from_collections -> should be called
> > > >>> from_collection
> > > >>>   - StreamExecutionEnvironment.generate_sequenece -> should be
> called
> > > >>> generate_sequence
> > > >>>
> > > >>> I think most of the predefined sources are questionable:
> > > >>>   - fromParallelCollection: I don't know if this is useful
> > > >>>   - readTextFile: most of the variants are not
> useful/fault-tolerant
> > > >>>   - readFile: same
> > > >>>   - socketTextStream: also not useful except for toy examples
> > > >>>   - createInput: also not useful, and it's legacy DataSet
> > InputFormats
> > > >>>
> > > >>> I think we need to think hard whether we want to further expose Row
> > in
> > > >> our
> > > >>> APIs. I think adding it to flink-core was more an accident than
> > > anything
> > > >>> else but I can see that it would be useful for Python/Java interop.
> > > >>>
> > > >>> Best,
> > > >>> Aljoscha
> > > >>>
> > > >>>
> > > >>> On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote:
> > > >>>> Thanks for bring up this DISCUSS Shuiqiang!
> > > >>>>
> > > >>>> +1 for the proposal!
> > > >>>>
> > > >>>> Best,
> > > >>>> Jincheng
> > > >>>>
> > > >>>>
> > > >>>> Xingbo Huang <hx...@gmail.com> 于2020年7月9日周四 上午10:41写道：
> > > >>>>
> > > >>>>> Hi Shuiqiang,
> > > >>>>>
> > > >>>>> Thanks a lot for driving this discussion.
> > > >>>>> Big +1 for supporting Python DataStream.
> > > >>>>> In many ML scenarios, operating Object will be more natural than
> > > >>> operating
> > > >>>>> Table.
> > > >>>>>
> > > >>>>> Best,
> > > >>>>> Xingbo
> > > >>>>>
> > > >>>>> Wei Zhong <we...@gmail.com> 于2020年7月9日周四 上午10:35写道：
> > > >>>>>
> > > >>>>>> Hi Shuiqiang,
> > > >>>>>>
> > > >>>>>> Thanks for driving this. Big +1 for supporting DataStream API in
> > > >>> PyFlink!
> > > >>>>>>
> > > >>>>>> Best,
> > > >>>>>> Wei
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> 在 2020年7月9日，10:29，Hequn Cheng <he...@apache.org> 写道：
> > > >>>>>>>
> > > >>>>>>> +1 for adding the Python DataStream API and starting with the
> > > >>> stateless
> > > >>>>>>> part.
> > > >>>>>>> There are already some users that expressed their wish to have
> > > >> the
> > > >>>>> Python
> > > >>>>>>> DataStream APIs. Once we have the APIs in PyFlink, we can cover
> > > >>> more
> > > >>>>> use
> > > >>>>>>> cases for our users.
> > > >>>>>>>
> > > >>>>>>> Best, Hequn
> > > >>>>>>>
> > > >>>>>>> On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <
> > > >>> acqua.csq@gmail.com>
> > > >>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Sorry, the 3rd link is broken, please refer to this one:
> Support
> > > >>>>> Python
> > > >>>>>>>> DataStream API
> > > >>>>>>>> <
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> Shuiqiang Chen <ac...@gmail.com> 于2020年7月8日周三 上午11:13写道：
> > > >>>>>>>>
> > > >>>>>>>>> Hi everyone,
> > > >>>>>>>>>
> > > >>>>>>>>> As we all know, Flink provides three layered APIs: the
> > > >>>>>> ProcessFunctions,
> > > >>>>>>>>> the DataStream API and the SQL & Table API. Each API offers a
> > > >>>>> different
> > > >>>>>>>>> trade-off between conciseness and expressiveness and targets
> > > >>>>> different
> > > >>>>>>>> use
> > > >>>>>>>>> cases[1].
> > > >>>>>>>>>
> > > >>>>>>>>> Currently, the SQL & Table API has already been supported in
> > > >>> PyFlink.
> > > >>>>>> The
> > > >>>>>>>>> API provides relational operations as well as user-defined
> > > >>> functions
> > > >>>>> to
> > > >>>>>>>>> provide convenience for users who are familiar with python
> and
> > > >>>>>> relational
> > > >>>>>>>>> programming.
> > > >>>>>>>>>
> > > >>>>>>>>> Meanwhile, the DataStream API and ProcessFunctions provide
> more
> > > >>>>> generic
> > > >>>>>>>>> APIs to implement stream processing applications. The
> > > >>>>> ProcessFunctions
> > > >>>>>>>>> expose time and state which are the fundamental building
> blocks
> > > >>> for
> > > >>>>> any
> > > >>>>>>>>> kind of streaming application.
> > > >>>>>>>>> To cover more use cases, we are planning to cover all these
> > > >> APIs
> > > >>> in
> > > >>>>>>>>> PyFlink.
> > > >>>>>>>>>
> > > >>>>>>>>> In this discussion(FLIP-130), we propose to support the
> Python
> > > >>>>>> DataStream
> > > >>>>>>>>> API for the stateless part. For more detail, please refer to
> > > >> the
> > > >>> FLIP
> > > >>>>>>>> wiki
> > > >>>>>>>>> page here[2]. If interested in the stateful part, you can
> also
> > > >>> take a
> > > >>>>>>>>> look the design doc here[3] for which we are going to discuss
> > > >> in
> > > >>> a
> > > >>>>>>>> separate
> > > >>>>>>>>> FLIP.
> > > >>>>>>>>>
> > > >>>>>>>>> Any comments will be highly appreciated!
> > > >>>>>>>>>
> > > >>>>>>>>> [1]
> > > >>> https://flink.apache.org/flink-applications.html#layered-apis
> > > >>>>>>>>> [2]
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
> > > >>>>>>>>> [3]
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>
> > >
> >
> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
> > > >>>>>>>>>
> > > >>>>>>>>> Best,
> > > >>>>>>>>> Shuiqiang
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >
> > >
> > >
> >
>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Posted by Shuiqiang Chen <ac...@gmail.com>.

Hi Aljoscha, Thank you for your response.  I'll keep these two helper
methods in the Python DataStream implementation.

And thank you all for joining in the discussion. It seems that we have
reached a consensus. I will start a vote for this FLIP later today.

Best,
Shuiqiang

Hequn Cheng <he...@apache.org> 于2020年7月24日周五 下午5:29写道：

> Thanks a lot for your valuable feedback and suggestions! @Aljoscha Krettek
> <al...@apache.org>
> +1 to the vote.
>
> Best,
> Hequn
>
> On Fri, Jul 24, 2020 at 5:16 PM Aljoscha Krettek <al...@apache.org>
> wrote:
>
> > Thanks for updating! And yes, I think it's ok to include the few helper
> > methods such as "readFromFile" and "print".
> >
> > I think we can now proceed to a vote! Nice work, overall!
> >
> > Best,
> > Aljoscha
> >
> > On 16.07.20 17:16, Hequn Cheng wrote:
> > > Hi,
> > >
> > > Thanks a lot for your discussions.
> > > I think Aljoscha makes good suggestions here! Those problematic APIs
> > should
> > > not be added to the new Python DataStream API.
> > >
> > > Only one item I want to add based on the reply from Shuiqiang:
> > > I would also tend to keep the readTextFile() method. Apart from
> print(),
> > > the readTextFile() may also be very helpful and frequently used for
> > playing
> > > with Flink.
> > > For example, it is used in our WordCount example[1] which is almost the
> > > first Flink program that every beginner runs.
> > > It is more efficient for reading multi-line data compared to
> > > fromCollection() meanwhile far more easier to be used compared to
> Kafka,
> > > Kinesis, RabbitMQ,etc., in
> > > cases for playing with Flink.
> > >
> > > What do you think?
> > >
> > > Best,
> > > Hequn
> > >
> > > [1]
> > >
> >
> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java
> > >
> > >
> > > On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <ac...@gmail.com>
> > wrote:
> > >
> > >> Hi Aljoscha,
> > >>
> > >> Thank you for your valuable comments! I agree with you that there is
> > some
> > >> optimization space for existing API and can be applied to the python
> > >> DataStream API implementation.
> > >>
> > >> According to your comments, I have concluded them into the following
> > parts:
> > >>
> > >> 1. SingleOutputStreamOperator and DataStreamSource.
> > >> Yes, the SingleOutputStreamOperator and DataStreamSource are a bit
> > >> redundant, so we can unify their APIs into DataStream to make it more
> > >> clear.
> > >>
> > >> 2. The internal or low-level methods.
> > >>   - DataStream.get_id(): Has been removed in the FLIP wiki page.
> > >>   - DataStream.partition_custom(): Has been removed in the FLIP wiki
> > page.
> > >>   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: Has
> been
> > >> removed in the FLIP wiki page.
> > >> Sorry for mistakenly making those internal methods public, we would
> not
> > >> expose them to users in the Python API.
> > >>
> > >> 3. "declarative" Apis.
> > >> - KeyedStream.sum/min/max/min_by/max_by: Has been removed in the FLIP
> > wiki
> > >> page. They could be well covered by Table API.
> > >>
> > >> 4. Spelling problems.
> > >> - StreamExecutionEnvironment.from_collections. Should be
> > from_collection().
> > >> - StreamExecutionEnvironment.generate_sequenece. Should be
> > >> generate_sequence().
> > >> Sorry for the spelling error.
> > >>
> > >> 5. Predefined source and sink.
> > >> As you said, most of the predefined sources are not suitable for
> > >> production, we can ignore them in the new Python DataStream API.
> > >> There is one exception that maybe I think we should add the print()
> > since
> > >> it is commonly used by users and it is very useful for debugging jobs.
> > We
> > >> can add comments for the API that it should never be used for
> > production.
> > >> Meanwhile, as you mentioned, a good alternative that always prints on
> > the
> > >> client should also be supported. For this case, maybe we can add the
> > >> collect method and return an Iterator. With the iterator, uses can
> print
> > >> the content on the client. This is also consistent with the behavior
> in
> > >> Table API.
> > >>
> > >> 6. For Row.
> > >> Do you mean that we should not expose the Row type in Python API?
> Maybe
> > I
> > >> haven't gotten your concerns well.
> > >> We can use tuple type in Python DataStream to support Row. (I have
> > updated
> > >> the example section of the FLIP to reflect the design.)
> > >>
> > >> Highly appreciated for your suggestions again. Looking forward to your
> > >> feedback.
> > >>
> > >> Best,
> > >> Shuiqiang
> > >>
> > >> Aljoscha Krettek <al...@apache.org> 于2020年7月15日周三 下午5:58写道：
> > >>
> > >>> Hi,
> > >>>
> > >>> thanks for the proposal! I have some comments about the API. We
> should
> > >> not
> > >>> blindly copy the existing Java DataSteam because we made some
> mistakes
> > >> with
> > >>> that and we now have a chance to fix them and not forward them to a
> new
> > >> API.
> > >>>
> > >>> I don't think we need SingleOutputStreamOperator, in the Scala API we
> > >> just
> > >>> have DataStream and the relevant methods from
> > SingleOutputStreamOperator
> > >>> are added to DataStream. Having this extra type is more confusing
> than
> > >>> helpful to users, I think. In the same vain, I think we also don't
> need
> > >>> DataStreamSource. The source methods can also just return a
> DataStream.
> > >>>
> > >>> There are some methods that I would consider internal and we
> shouldn't
> > >>> expose them:
> > >>>   - DataStream.get_id(): this is an internal method
> > >>>   - DataStream.partition_custom(): I think adding this method was a
> > >> mistake
> > >>> because it's to low-level, I could be convinced otherwise
> > >>>   - DataStream.print()/DataStream.print_to_error(): These are
> > questionable
> > >>> because they print to the TaskManager log. Maybe we could add a good
> > >>> alternative that always prints on the client, similar to the Table
> API
> > >>>   - DataStream.write_to_socket(): It was a mistake to add this sink
> on
> > >>> DataStream it is not fault-tolerant and shouldn't be used in
> production
> > >>>
> > >>>   - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API
> > should
> > >>> be used for "declarative" use cases and I think these methods should
> > not
> > >> be
> > >>> in the DataStream API
> > >>>   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: these
> > are
> > >>> internal methods
> > >>>
> > >>>   - StreamExecutionEnvironment.from_parallel_collection(): I think
> the
> > >>> usability is questionable
> > >>>   - StreamExecutionEnvironment.from_collections -> should be called
> > >>> from_collection
> > >>>   - StreamExecutionEnvironment.generate_sequenece -> should be called
> > >>> generate_sequence
> > >>>
> > >>> I think most of the predefined sources are questionable:
> > >>>   - fromParallelCollection: I don't know if this is useful
> > >>>   - readTextFile: most of the variants are not useful/fault-tolerant
> > >>>   - readFile: same
> > >>>   - socketTextStream: also not useful except for toy examples
> > >>>   - createInput: also not useful, and it's legacy DataSet
> InputFormats
> > >>>
> > >>> I think we need to think hard whether we want to further expose Row
> in
> > >> our
> > >>> APIs. I think adding it to flink-core was more an accident than
> > anything
> > >>> else but I can see that it would be useful for Python/Java interop.
> > >>>
> > >>> Best,
> > >>> Aljoscha
> > >>>
> > >>>
> > >>> On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote:
> > >>>> Thanks for bring up this DISCUSS Shuiqiang!
> > >>>>
> > >>>> +1 for the proposal!
> > >>>>
> > >>>> Best,
> > >>>> Jincheng
> > >>>>
> > >>>>
> > >>>> Xingbo Huang <hx...@gmail.com> 于2020年7月9日周四 上午10:41写道：
> > >>>>
> > >>>>> Hi Shuiqiang,
> > >>>>>
> > >>>>> Thanks a lot for driving this discussion.
> > >>>>> Big +1 for supporting Python DataStream.
> > >>>>> In many ML scenarios, operating Object will be more natural than
> > >>> operating
> > >>>>> Table.
> > >>>>>
> > >>>>> Best,
> > >>>>> Xingbo
> > >>>>>
> > >>>>> Wei Zhong <we...@gmail.com> 于2020年7月9日周四 上午10:35写道：
> > >>>>>
> > >>>>>> Hi Shuiqiang,
> > >>>>>>
> > >>>>>> Thanks for driving this. Big +1 for supporting DataStream API in
> > >>> PyFlink!
> > >>>>>>
> > >>>>>> Best,
> > >>>>>> Wei
> > >>>>>>
> > >>>>>>
> > >>>>>>> 在 2020年7月9日，10:29，Hequn Cheng <he...@apache.org> 写道：
> > >>>>>>>
> > >>>>>>> +1 for adding the Python DataStream API and starting with the
> > >>> stateless
> > >>>>>>> part.
> > >>>>>>> There are already some users that expressed their wish to have
> > >> the
> > >>>>> Python
> > >>>>>>> DataStream APIs. Once we have the APIs in PyFlink, we can cover
> > >>> more
> > >>>>> use
> > >>>>>>> cases for our users.
> > >>>>>>>
> > >>>>>>> Best, Hequn
> > >>>>>>>
> > >>>>>>> On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <
> > >>> acqua.csq@gmail.com>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Sorry, the 3rd link is broken, please refer to this one: Support
> > >>>>> Python
> > >>>>>>>> DataStream API
> > >>>>>>>> <
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Shuiqiang Chen <ac...@gmail.com> 于2020年7月8日周三 上午11:13写道：
> > >>>>>>>>
> > >>>>>>>>> Hi everyone,
> > >>>>>>>>>
> > >>>>>>>>> As we all know, Flink provides three layered APIs: the
> > >>>>>> ProcessFunctions,
> > >>>>>>>>> the DataStream API and the SQL & Table API. Each API offers a
> > >>>>> different
> > >>>>>>>>> trade-off between conciseness and expressiveness and targets
> > >>>>> different
> > >>>>>>>> use
> > >>>>>>>>> cases[1].
> > >>>>>>>>>
> > >>>>>>>>> Currently, the SQL & Table API has already been supported in
> > >>> PyFlink.
> > >>>>>> The
> > >>>>>>>>> API provides relational operations as well as user-defined
> > >>> functions
> > >>>>> to
> > >>>>>>>>> provide convenience for users who are familiar with python and
> > >>>>>> relational
> > >>>>>>>>> programming.
> > >>>>>>>>>
> > >>>>>>>>> Meanwhile, the DataStream API and ProcessFunctions provide more
> > >>>>> generic
> > >>>>>>>>> APIs to implement stream processing applications. The
> > >>>>> ProcessFunctions
> > >>>>>>>>> expose time and state which are the fundamental building blocks
> > >>> for
> > >>>>> any
> > >>>>>>>>> kind of streaming application.
> > >>>>>>>>> To cover more use cases, we are planning to cover all these
> > >> APIs
> > >>> in
> > >>>>>>>>> PyFlink.
> > >>>>>>>>>
> > >>>>>>>>> In this discussion(FLIP-130), we propose to support the Python
> > >>>>>> DataStream
> > >>>>>>>>> API for the stateless part. For more detail, please refer to
> > >> the
> > >>> FLIP
> > >>>>>>>> wiki
> > >>>>>>>>> page here[2]. If interested in the stateful part, you can also
> > >>> take a
> > >>>>>>>>> look the design doc here[3] for which we are going to discuss
> > >> in
> > >>> a
> > >>>>>>>> separate
> > >>>>>>>>> FLIP.
> > >>>>>>>>>
> > >>>>>>>>> Any comments will be highly appreciated!
> > >>>>>>>>>
> > >>>>>>>>> [1]
> > >>> https://flink.apache.org/flink-applications.html#layered-apis
> > >>>>>>>>> [2]
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
> > >>>>>>>>> [3]
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
> > >>>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>> Shuiqiang
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >
> >
> >
>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Posted by Hequn Cheng <he...@apache.org>.

Thanks a lot for your valuable feedback and suggestions! @Aljoscha Krettek
<al...@apache.org>
+1 to the vote.

Best,
Hequn

On Fri, Jul 24, 2020 at 5:16 PM Aljoscha Krettek <al...@apache.org>
wrote:

> Thanks for updating! And yes, I think it's ok to include the few helper
> methods such as "readFromFile" and "print".
>
> I think we can now proceed to a vote! Nice work, overall!
>
> Best,
> Aljoscha
>
> On 16.07.20 17:16, Hequn Cheng wrote:
> > Hi,
> >
> > Thanks a lot for your discussions.
> > I think Aljoscha makes good suggestions here! Those problematic APIs
> should
> > not be added to the new Python DataStream API.
> >
> > Only one item I want to add based on the reply from Shuiqiang:
> > I would also tend to keep the readTextFile() method. Apart from print(),
> > the readTextFile() may also be very helpful and frequently used for
> playing
> > with Flink.
> > For example, it is used in our WordCount example[1] which is almost the
> > first Flink program that every beginner runs.
> > It is more efficient for reading multi-line data compared to
> > fromCollection() meanwhile far more easier to be used compared to Kafka,
> > Kinesis, RabbitMQ,etc., in
> > cases for playing with Flink.
> >
> > What do you think?
> >
> > Best,
> > Hequn
> >
> > [1]
> >
> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java
> >
> >
> > On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <ac...@gmail.com>
> wrote:
> >
> >> Hi Aljoscha,
> >>
> >> Thank you for your valuable comments! I agree with you that there is
> some
> >> optimization space for existing API and can be applied to the python
> >> DataStream API implementation.
> >>
> >> According to your comments, I have concluded them into the following
> parts:
> >>
> >> 1. SingleOutputStreamOperator and DataStreamSource.
> >> Yes, the SingleOutputStreamOperator and DataStreamSource are a bit
> >> redundant, so we can unify their APIs into DataStream to make it more
> >> clear.
> >>
> >> 2. The internal or low-level methods.
> >>   - DataStream.get_id(): Has been removed in the FLIP wiki page.
> >>   - DataStream.partition_custom(): Has been removed in the FLIP wiki
> page.
> >>   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: Has been
> >> removed in the FLIP wiki page.
> >> Sorry for mistakenly making those internal methods public, we would not
> >> expose them to users in the Python API.
> >>
> >> 3. "declarative" Apis.
> >> - KeyedStream.sum/min/max/min_by/max_by: Has been removed in the FLIP
> wiki
> >> page. They could be well covered by Table API.
> >>
> >> 4. Spelling problems.
> >> - StreamExecutionEnvironment.from_collections. Should be
> from_collection().
> >> - StreamExecutionEnvironment.generate_sequenece. Should be
> >> generate_sequence().
> >> Sorry for the spelling error.
> >>
> >> 5. Predefined source and sink.
> >> As you said, most of the predefined sources are not suitable for
> >> production, we can ignore them in the new Python DataStream API.
> >> There is one exception that maybe I think we should add the print()
> since
> >> it is commonly used by users and it is very useful for debugging jobs.
> We
> >> can add comments for the API that it should never be used for
> production.
> >> Meanwhile, as you mentioned, a good alternative that always prints on
> the
> >> client should also be supported. For this case, maybe we can add the
> >> collect method and return an Iterator. With the iterator, uses can print
> >> the content on the client. This is also consistent with the behavior in
> >> Table API.
> >>
> >> 6. For Row.
> >> Do you mean that we should not expose the Row type in Python API? Maybe
> I
> >> haven't gotten your concerns well.
> >> We can use tuple type in Python DataStream to support Row. (I have
> updated
> >> the example section of the FLIP to reflect the design.)
> >>
> >> Highly appreciated for your suggestions again. Looking forward to your
> >> feedback.
> >>
> >> Best,
> >> Shuiqiang
> >>
> >> Aljoscha Krettek <al...@apache.org> 于2020年7月15日周三 下午5:58写道：
> >>
> >>> Hi,
> >>>
> >>> thanks for the proposal! I have some comments about the API. We should
> >> not
> >>> blindly copy the existing Java DataSteam because we made some mistakes
> >> with
> >>> that and we now have a chance to fix them and not forward them to a new
> >> API.
> >>>
> >>> I don't think we need SingleOutputStreamOperator, in the Scala API we
> >> just
> >>> have DataStream and the relevant methods from
> SingleOutputStreamOperator
> >>> are added to DataStream. Having this extra type is more confusing than
> >>> helpful to users, I think. In the same vain, I think we also don't need
> >>> DataStreamSource. The source methods can also just return a DataStream.
> >>>
> >>> There are some methods that I would consider internal and we shouldn't
> >>> expose them:
> >>>   - DataStream.get_id(): this is an internal method
> >>>   - DataStream.partition_custom(): I think adding this method was a
> >> mistake
> >>> because it's to low-level, I could be convinced otherwise
> >>>   - DataStream.print()/DataStream.print_to_error(): These are
> questionable
> >>> because they print to the TaskManager log. Maybe we could add a good
> >>> alternative that always prints on the client, similar to the Table API
> >>>   - DataStream.write_to_socket(): It was a mistake to add this sink on
> >>> DataStream it is not fault-tolerant and shouldn't be used in production
> >>>
> >>>   - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API
> should
> >>> be used for "declarative" use cases and I think these methods should
> not
> >> be
> >>> in the DataStream API
> >>>   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: these
> are
> >>> internal methods
> >>>
> >>>   - StreamExecutionEnvironment.from_parallel_collection(): I think the
> >>> usability is questionable
> >>>   - StreamExecutionEnvironment.from_collections -> should be called
> >>> from_collection
> >>>   - StreamExecutionEnvironment.generate_sequenece -> should be called
> >>> generate_sequence
> >>>
> >>> I think most of the predefined sources are questionable:
> >>>   - fromParallelCollection: I don't know if this is useful
> >>>   - readTextFile: most of the variants are not useful/fault-tolerant
> >>>   - readFile: same
> >>>   - socketTextStream: also not useful except for toy examples
> >>>   - createInput: also not useful, and it's legacy DataSet InputFormats
> >>>
> >>> I think we need to think hard whether we want to further expose Row in
> >> our
> >>> APIs. I think adding it to flink-core was more an accident than
> anything
> >>> else but I can see that it would be useful for Python/Java interop.
> >>>
> >>> Best,
> >>> Aljoscha
> >>>
> >>>
> >>> On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote:
> >>>> Thanks for bring up this DISCUSS Shuiqiang!
> >>>>
> >>>> +1 for the proposal!
> >>>>
> >>>> Best,
> >>>> Jincheng
> >>>>
> >>>>
> >>>> Xingbo Huang <hx...@gmail.com> 于2020年7月9日周四 上午10:41写道：
> >>>>
> >>>>> Hi Shuiqiang,
> >>>>>
> >>>>> Thanks a lot for driving this discussion.
> >>>>> Big +1 for supporting Python DataStream.
> >>>>> In many ML scenarios, operating Object will be more natural than
> >>> operating
> >>>>> Table.
> >>>>>
> >>>>> Best,
> >>>>> Xingbo
> >>>>>
> >>>>> Wei Zhong <we...@gmail.com> 于2020年7月9日周四 上午10:35写道：
> >>>>>
> >>>>>> Hi Shuiqiang,
> >>>>>>
> >>>>>> Thanks for driving this. Big +1 for supporting DataStream API in
> >>> PyFlink!
> >>>>>>
> >>>>>> Best,
> >>>>>> Wei
> >>>>>>
> >>>>>>
> >>>>>>> 在 2020年7月9日，10:29，Hequn Cheng <he...@apache.org> 写道：
> >>>>>>>
> >>>>>>> +1 for adding the Python DataStream API and starting with the
> >>> stateless
> >>>>>>> part.
> >>>>>>> There are already some users that expressed their wish to have
> >> the
> >>>>> Python
> >>>>>>> DataStream APIs. Once we have the APIs in PyFlink, we can cover
> >>> more
> >>>>> use
> >>>>>>> cases for our users.
> >>>>>>>
> >>>>>>> Best, Hequn
> >>>>>>>
> >>>>>>> On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <
> >>> acqua.csq@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Sorry, the 3rd link is broken, please refer to this one: Support
> >>>>> Python
> >>>>>>>> DataStream API
> >>>>>>>> <
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Shuiqiang Chen <ac...@gmail.com> 于2020年7月8日周三 上午11:13写道：
> >>>>>>>>
> >>>>>>>>> Hi everyone,
> >>>>>>>>>
> >>>>>>>>> As we all know, Flink provides three layered APIs: the
> >>>>>> ProcessFunctions,
> >>>>>>>>> the DataStream API and the SQL & Table API. Each API offers a
> >>>>> different
> >>>>>>>>> trade-off between conciseness and expressiveness and targets
> >>>>> different
> >>>>>>>> use
> >>>>>>>>> cases[1].
> >>>>>>>>>
> >>>>>>>>> Currently, the SQL & Table API has already been supported in
> >>> PyFlink.
> >>>>>> The
> >>>>>>>>> API provides relational operations as well as user-defined
> >>> functions
> >>>>> to
> >>>>>>>>> provide convenience for users who are familiar with python and
> >>>>>> relational
> >>>>>>>>> programming.
> >>>>>>>>>
> >>>>>>>>> Meanwhile, the DataStream API and ProcessFunctions provide more
> >>>>> generic
> >>>>>>>>> APIs to implement stream processing applications. The
> >>>>> ProcessFunctions
> >>>>>>>>> expose time and state which are the fundamental building blocks
> >>> for
> >>>>> any
> >>>>>>>>> kind of streaming application.
> >>>>>>>>> To cover more use cases, we are planning to cover all these
> >> APIs
> >>> in
> >>>>>>>>> PyFlink.
> >>>>>>>>>
> >>>>>>>>> In this discussion(FLIP-130), we propose to support the Python
> >>>>>> DataStream
> >>>>>>>>> API for the stateless part. For more detail, please refer to
> >> the
> >>> FLIP
> >>>>>>>> wiki
> >>>>>>>>> page here[2]. If interested in the stateful part, you can also
> >>> take a
> >>>>>>>>> look the design doc here[3] for which we are going to discuss
> >> in
> >>> a
> >>>>>>>> separate
> >>>>>>>>> FLIP.
> >>>>>>>>>
> >>>>>>>>> Any comments will be highly appreciated!
> >>>>>>>>>
> >>>>>>>>> [1]
> >>> https://flink.apache.org/flink-applications.html#layered-apis
> >>>>>>>>> [2]
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
> >>>>>>>>> [3]
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Shuiqiang
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Posted by Aljoscha Krettek <al...@apache.org>.

Thanks for updating! And yes, I think it's ok to include the few helper 
methods such as "readFromFile" and "print".

I think we can now proceed to a vote! Nice work, overall!

Best,
Aljoscha

On 16.07.20 17:16, Hequn Cheng wrote:
> Hi,
> 
> Thanks a lot for your discussions.
> I think Aljoscha makes good suggestions here! Those problematic APIs should
> not be added to the new Python DataStream API.
> 
> Only one item I want to add based on the reply from Shuiqiang:
> I would also tend to keep the readTextFile() method. Apart from print(),
> the readTextFile() may also be very helpful and frequently used for playing
> with Flink.
> For example, it is used in our WordCount example[1] which is almost the
> first Flink program that every beginner runs.
> It is more efficient for reading multi-line data compared to
> fromCollection() meanwhile far more easier to be used compared to Kafka,
> Kinesis, RabbitMQ,etc., in
> cases for playing with Flink.
> 
> What do you think?
> 
> Best,
> Hequn
> 
> [1]
> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java
> 
> 
> On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <ac...@gmail.com> wrote:
> 
>> Hi Aljoscha,
>>
>> Thank you for your valuable comments! I agree with you that there is some
>> optimization space for existing API and can be applied to the python
>> DataStream API implementation.
>>
>> According to your comments, I have concluded them into the following parts:
>>
>> 1. SingleOutputStreamOperator and DataStreamSource.
>> Yes, the SingleOutputStreamOperator and DataStreamSource are a bit
>> redundant, so we can unify their APIs into DataStream to make it more
>> clear.
>>
>> 2. The internal or low-level methods.
>>   - DataStream.get_id(): Has been removed in the FLIP wiki page.
>>   - DataStream.partition_custom(): Has been removed in the FLIP wiki page.
>>   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: Has been
>> removed in the FLIP wiki page.
>> Sorry for mistakenly making those internal methods public, we would not
>> expose them to users in the Python API.
>>
>> 3. "declarative" Apis.
>> - KeyedStream.sum/min/max/min_by/max_by: Has been removed in the FLIP wiki
>> page. They could be well covered by Table API.
>>
>> 4. Spelling problems.
>> - StreamExecutionEnvironment.from_collections. Should be from_collection().
>> - StreamExecutionEnvironment.generate_sequenece. Should be
>> generate_sequence().
>> Sorry for the spelling error.
>>
>> 5. Predefined source and sink.
>> As you said, most of the predefined sources are not suitable for
>> production, we can ignore them in the new Python DataStream API.
>> There is one exception that maybe I think we should add the print() since
>> it is commonly used by users and it is very useful for debugging jobs. We
>> can add comments for the API that it should never be used for production.
>> Meanwhile, as you mentioned, a good alternative that always prints on the
>> client should also be supported. For this case, maybe we can add the
>> collect method and return an Iterator. With the iterator, uses can print
>> the content on the client. This is also consistent with the behavior in
>> Table API.
>>
>> 6. For Row.
>> Do you mean that we should not expose the Row type in Python API? Maybe I
>> haven't gotten your concerns well.
>> We can use tuple type in Python DataStream to support Row. (I have updated
>> the example section of the FLIP to reflect the design.)
>>
>> Highly appreciated for your suggestions again. Looking forward to your
>> feedback.
>>
>> Best,
>> Shuiqiang
>>
>> Aljoscha Krettek <al...@apache.org> 于2020年7月15日周三 下午5:58写道：
>>
>>> Hi,
>>>
>>> thanks for the proposal! I have some comments about the API. We should
>> not
>>> blindly copy the existing Java DataSteam because we made some mistakes
>> with
>>> that and we now have a chance to fix them and not forward them to a new
>> API.
>>>
>>> I don't think we need SingleOutputStreamOperator, in the Scala API we
>> just
>>> have DataStream and the relevant methods from SingleOutputStreamOperator
>>> are added to DataStream. Having this extra type is more confusing than
>>> helpful to users, I think. In the same vain, I think we also don't need
>>> DataStreamSource. The source methods can also just return a DataStream.
>>>
>>> There are some methods that I would consider internal and we shouldn't
>>> expose them:
>>>   - DataStream.get_id(): this is an internal method
>>>   - DataStream.partition_custom(): I think adding this method was a
>> mistake
>>> because it's to low-level, I could be convinced otherwise
>>>   - DataStream.print()/DataStream.print_to_error(): These are questionable
>>> because they print to the TaskManager log. Maybe we could add a good
>>> alternative that always prints on the client, similar to the Table API
>>>   - DataStream.write_to_socket(): It was a mistake to add this sink on
>>> DataStream it is not fault-tolerant and shouldn't be used in production
>>>
>>>   - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API should
>>> be used for "declarative" use cases and I think these methods should not
>> be
>>> in the DataStream API
>>>   - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: these are
>>> internal methods
>>>
>>>   - StreamExecutionEnvironment.from_parallel_collection(): I think the
>>> usability is questionable
>>>   - StreamExecutionEnvironment.from_collections -> should be called
>>> from_collection
>>>   - StreamExecutionEnvironment.generate_sequenece -> should be called
>>> generate_sequence
>>>
>>> I think most of the predefined sources are questionable:
>>>   - fromParallelCollection: I don't know if this is useful
>>>   - readTextFile: most of the variants are not useful/fault-tolerant
>>>   - readFile: same
>>>   - socketTextStream: also not useful except for toy examples
>>>   - createInput: also not useful, and it's legacy DataSet InputFormats
>>>
>>> I think we need to think hard whether we want to further expose Row in
>> our
>>> APIs. I think adding it to flink-core was more an accident than anything
>>> else but I can see that it would be useful for Python/Java interop.
>>>
>>> Best,
>>> Aljoscha
>>>
>>>
>>> On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote:
>>>> Thanks for bring up this DISCUSS Shuiqiang!
>>>>
>>>> +1 for the proposal!
>>>>
>>>> Best,
>>>> Jincheng
>>>>
>>>>
>>>> Xingbo Huang <hx...@gmail.com> 于2020年7月9日周四 上午10:41写道：
>>>>
>>>>> Hi Shuiqiang,
>>>>>
>>>>> Thanks a lot for driving this discussion.
>>>>> Big +1 for supporting Python DataStream.
>>>>> In many ML scenarios, operating Object will be more natural than
>>> operating
>>>>> Table.
>>>>>
>>>>> Best,
>>>>> Xingbo
>>>>>
>>>>> Wei Zhong <we...@gmail.com> 于2020年7月9日周四 上午10:35写道：
>>>>>
>>>>>> Hi Shuiqiang,
>>>>>>
>>>>>> Thanks for driving this. Big +1 for supporting DataStream API in
>>> PyFlink!
>>>>>>
>>>>>> Best,
>>>>>> Wei
>>>>>>
>>>>>>
>>>>>>> 在 2020年7月9日，10:29，Hequn Cheng <he...@apache.org> 写道：
>>>>>>>
>>>>>>> +1 for adding the Python DataStream API and starting with the
>>> stateless
>>>>>>> part.
>>>>>>> There are already some users that expressed their wish to have
>> the
>>>>> Python
>>>>>>> DataStream APIs. Once we have the APIs in PyFlink, we can cover
>>> more
>>>>> use
>>>>>>> cases for our users.
>>>>>>>
>>>>>>> Best, Hequn
>>>>>>>
>>>>>>> On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <
>>> acqua.csq@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>> Sorry, the 3rd link is broken, please refer to this one: Support
>>>>> Python
>>>>>>>> DataStream API
>>>>>>>> <
>>>>>>>>
>>>>>>
>>>>>
>>>
>> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
>>>>>>>>>
>>>>>>>>
>>>>>>>> Shuiqiang Chen <ac...@gmail.com> 于2020年7月8日周三 上午11:13写道：
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> As we all know, Flink provides three layered APIs: the
>>>>>> ProcessFunctions,
>>>>>>>>> the DataStream API and the SQL & Table API. Each API offers a
>>>>> different
>>>>>>>>> trade-off between conciseness and expressiveness and targets
>>>>> different
>>>>>>>> use
>>>>>>>>> cases[1].
>>>>>>>>>
>>>>>>>>> Currently, the SQL & Table API has already been supported in
>>> PyFlink.
>>>>>> The
>>>>>>>>> API provides relational operations as well as user-defined
>>> functions
>>>>> to
>>>>>>>>> provide convenience for users who are familiar with python and
>>>>>> relational
>>>>>>>>> programming.
>>>>>>>>>
>>>>>>>>> Meanwhile, the DataStream API and ProcessFunctions provide more
>>>>> generic
>>>>>>>>> APIs to implement stream processing applications. The
>>>>> ProcessFunctions
>>>>>>>>> expose time and state which are the fundamental building blocks
>>> for
>>>>> any
>>>>>>>>> kind of streaming application.
>>>>>>>>> To cover more use cases, we are planning to cover all these
>> APIs
>>> in
>>>>>>>>> PyFlink.
>>>>>>>>>
>>>>>>>>> In this discussion(FLIP-130), we propose to support the Python
>>>>>> DataStream
>>>>>>>>> API for the stateless part. For more detail, please refer to
>> the
>>> FLIP
>>>>>>>> wiki
>>>>>>>>> page here[2]. If interested in the stateful part, you can also
>>> take a
>>>>>>>>> look the design doc here[3] for which we are going to discuss
>> in
>>> a
>>>>>>>> separate
>>>>>>>>> FLIP.
>>>>>>>>>
>>>>>>>>> Any comments will be highly appreciated!
>>>>>>>>>
>>>>>>>>> [1]
>>> https://flink.apache.org/flink-applications.html#layered-apis
>>>>>>>>> [2]
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
>>>>>>>>> [3]
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>
>> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Shuiqiang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Posted by Hequn Cheng <he...@apache.org>.

Hi,

Thanks a lot for your discussions.
I think Aljoscha makes good suggestions here! Those problematic APIs should
not be added to the new Python DataStream API.

Only one item I want to add based on the reply from Shuiqiang:
I would also tend to keep the readTextFile() method. Apart from print(),
the readTextFile() may also be very helpful and frequently used for playing
with Flink.
For example, it is used in our WordCount example[1] which is almost the
first Flink program that every beginner runs.
It is more efficient for reading multi-line data compared to
fromCollection() meanwhile far more easier to be used compared to Kafka,
Kinesis, RabbitMQ,etc., in
cases for playing with Flink.

What do you think?

Best,
Hequn

[1]
https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java


On Thu, Jul 16, 2020 at 3:37 PM Shuiqiang Chen <ac...@gmail.com> wrote:

> Hi Aljoscha,
>
> Thank you for your valuable comments! I agree with you that there is some
> optimization space for existing API and can be applied to the python
> DataStream API implementation.
>
> According to your comments, I have concluded them into the following parts:
>
> 1. SingleOutputStreamOperator and DataStreamSource.
> Yes, the SingleOutputStreamOperator and DataStreamSource are a bit
> redundant, so we can unify their APIs into DataStream to make it more
> clear.
>
> 2. The internal or low-level methods.
>  - DataStream.get_id(): Has been removed in the FLIP wiki page.
>  - DataStream.partition_custom(): Has been removed in the FLIP wiki page.
>  - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: Has been
> removed in the FLIP wiki page.
> Sorry for mistakenly making those internal methods public, we would not
> expose them to users in the Python API.
>
> 3. "declarative" Apis.
> - KeyedStream.sum/min/max/min_by/max_by: Has been removed in the FLIP wiki
> page. They could be well covered by Table API.
>
> 4. Spelling problems.
> - StreamExecutionEnvironment.from_collections. Should be from_collection().
> - StreamExecutionEnvironment.generate_sequenece. Should be
> generate_sequence().
> Sorry for the spelling error.
>
> 5. Predefined source and sink.
> As you said, most of the predefined sources are not suitable for
> production, we can ignore them in the new Python DataStream API.
> There is one exception that maybe I think we should add the print() since
> it is commonly used by users and it is very useful for debugging jobs. We
> can add comments for the API that it should never be used for production.
> Meanwhile, as you mentioned, a good alternative that always prints on the
> client should also be supported. For this case, maybe we can add the
> collect method and return an Iterator. With the iterator, uses can print
> the content on the client. This is also consistent with the behavior in
> Table API.
>
> 6. For Row.
> Do you mean that we should not expose the Row type in Python API? Maybe I
> haven't gotten your concerns well.
> We can use tuple type in Python DataStream to support Row. (I have updated
> the example section of the FLIP to reflect the design.)
>
> Highly appreciated for your suggestions again. Looking forward to your
> feedback.
>
> Best,
> Shuiqiang
>
> Aljoscha Krettek <al...@apache.org> 于2020年7月15日周三 下午5:58写道：
>
> > Hi,
> >
> > thanks for the proposal! I have some comments about the API. We should
> not
> > blindly copy the existing Java DataSteam because we made some mistakes
> with
> > that and we now have a chance to fix them and not forward them to a new
> API.
> >
> > I don't think we need SingleOutputStreamOperator, in the Scala API we
> just
> > have DataStream and the relevant methods from SingleOutputStreamOperator
> > are added to DataStream. Having this extra type is more confusing than
> > helpful to users, I think. In the same vain, I think we also don't need
> > DataStreamSource. The source methods can also just return a DataStream.
> >
> > There are some methods that I would consider internal and we shouldn't
> > expose them:
> >  - DataStream.get_id(): this is an internal method
> >  - DataStream.partition_custom(): I think adding this method was a
> mistake
> > because it's to low-level, I could be convinced otherwise
> >  - DataStream.print()/DataStream.print_to_error(): These are questionable
> > because they print to the TaskManager log. Maybe we could add a good
> > alternative that always prints on the client, similar to the Table API
> >  - DataStream.write_to_socket(): It was a mistake to add this sink on
> > DataStream it is not fault-tolerant and shouldn't be used in production
> >
> >  - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API should
> > be used for "declarative" use cases and I think these methods should not
> be
> > in the DataStream API
> >  - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: these are
> > internal methods
> >
> >  - StreamExecutionEnvironment.from_parallel_collection(): I think the
> > usability is questionable
> >  - StreamExecutionEnvironment.from_collections -> should be called
> > from_collection
> >  - StreamExecutionEnvironment.generate_sequenece -> should be called
> > generate_sequence
> >
> > I think most of the predefined sources are questionable:
> >  - fromParallelCollection: I don't know if this is useful
> >  - readTextFile: most of the variants are not useful/fault-tolerant
> >  - readFile: same
> >  - socketTextStream: also not useful except for toy examples
> >  - createInput: also not useful, and it's legacy DataSet InputFormats
> >
> > I think we need to think hard whether we want to further expose Row in
> our
> > APIs. I think adding it to flink-core was more an accident than anything
> > else but I can see that it would be useful for Python/Java interop.
> >
> > Best,
> > Aljoscha
> >
> >
> > On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote:
> > > Thanks for bring up this DISCUSS Shuiqiang!
> > >
> > > +1 for the proposal!
> > >
> > > Best,
> > > Jincheng
> > >
> > >
> > > Xingbo Huang <hx...@gmail.com> 于2020年7月9日周四 上午10:41写道：
> > >
> > > > Hi Shuiqiang,
> > > >
> > > > Thanks a lot for driving this discussion.
> > > > Big +1 for supporting Python DataStream.
> > > > In many ML scenarios, operating Object will be more natural than
> > operating
> > > > Table.
> > > >
> > > > Best,
> > > > Xingbo
> > > >
> > > > Wei Zhong <we...@gmail.com> 于2020年7月9日周四 上午10:35写道：
> > > >
> > > > > Hi Shuiqiang,
> > > > >
> > > > > Thanks for driving this. Big +1 for supporting DataStream API in
> > PyFlink!
> > > > >
> > > > > Best,
> > > > > Wei
> > > > >
> > > > >
> > > > > > 在 2020年7月9日，10:29，Hequn Cheng <he...@apache.org> 写道：
> > > > > >
> > > > > > +1 for adding the Python DataStream API and starting with the
> > stateless
> > > > > > part.
> > > > > > There are already some users that expressed their wish to have
> the
> > > > Python
> > > > > > DataStream APIs. Once we have the APIs in PyFlink, we can cover
> > more
> > > > use
> > > > > > cases for our users.
> > > > > >
> > > > > > Best, Hequn
> > > > > >
> > > > > > On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <
> > acqua.csq@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > >> Sorry, the 3rd link is broken, please refer to this one: Support
> > > > Python
> > > > > >> DataStream API
> > > > > >> <
> > > > > >>
> > > > >
> > > >
> >
> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
> > > > > >>>
> > > > > >>
> > > > > >> Shuiqiang Chen <ac...@gmail.com> 于2020年7月8日周三 上午11:13写道：
> > > > > >>
> > > > > >>> Hi everyone,
> > > > > >>>
> > > > > >>> As we all know, Flink provides three layered APIs: the
> > > > > ProcessFunctions,
> > > > > >>> the DataStream API and the SQL & Table API. Each API offers a
> > > > different
> > > > > >>> trade-off between conciseness and expressiveness and targets
> > > > different
> > > > > >> use
> > > > > >>> cases[1].
> > > > > >>>
> > > > > >>> Currently, the SQL & Table API has already been supported in
> > PyFlink.
> > > > > The
> > > > > >>> API provides relational operations as well as user-defined
> > functions
> > > > to
> > > > > >>> provide convenience for users who are familiar with python and
> > > > > relational
> > > > > >>> programming.
> > > > > >>>
> > > > > >>> Meanwhile, the DataStream API and ProcessFunctions provide more
> > > > generic
> > > > > >>> APIs to implement stream processing applications. The
> > > > ProcessFunctions
> > > > > >>> expose time and state which are the fundamental building blocks
> > for
> > > > any
> > > > > >>> kind of streaming application.
> > > > > >>> To cover more use cases, we are planning to cover all these
> APIs
> > in
> > > > > >>> PyFlink.
> > > > > >>>
> > > > > >>> In this discussion(FLIP-130), we propose to support the Python
> > > > > DataStream
> > > > > >>> API for the stateless part. For more detail, please refer to
> the
> > FLIP
> > > > > >> wiki
> > > > > >>> page here[2]. If interested in the stateful part, you can also
> > take a
> > > > > >>> look the design doc here[3] for which we are going to discuss
> in
> > a
> > > > > >> separate
> > > > > >>> FLIP.
> > > > > >>>
> > > > > >>> Any comments will be highly appreciated!
> > > > > >>>
> > > > > >>> [1]
> > https://flink.apache.org/flink-applications.html#layered-apis
> > > > > >>> [2]
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
> > > > > >>> [3]
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> >
> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
> > > > > >>>
> > > > > >>> Best,
> > > > > >>> Shuiqiang
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Posted by Shuiqiang Chen <ac...@gmail.com>.

Hi Aljoscha,

Thank you for your valuable comments! I agree with you that there is some
optimization space for existing API and can be applied to the python
DataStream API implementation.

According to your comments, I have concluded them into the following parts:

1. SingleOutputStreamOperator and DataStreamSource.
Yes, the SingleOutputStreamOperator and DataStreamSource are a bit
redundant, so we can unify their APIs into DataStream to make it more clear.

2. The internal or low-level methods.
 - DataStream.get_id(): Has been removed in the FLIP wiki page.
 - DataStream.partition_custom(): Has been removed in the FLIP wiki page.
 - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: Has been
removed in the FLIP wiki page.
Sorry for mistakenly making those internal methods public, we would not
expose them to users in the Python API.

3. "declarative" Apis.
- KeyedStream.sum/min/max/min_by/max_by: Has been removed in the FLIP wiki
page. They could be well covered by Table API.

4. Spelling problems.
- StreamExecutionEnvironment.from_collections. Should be from_collection().
- StreamExecutionEnvironment.generate_sequenece. Should be
generate_sequence().
Sorry for the spelling error.

5. Predefined source and sink.
As you said, most of the predefined sources are not suitable for
production, we can ignore them in the new Python DataStream API.
There is one exception that maybe I think we should add the print() since
it is commonly used by users and it is very useful for debugging jobs. We
can add comments for the API that it should never be used for production.
Meanwhile, as you mentioned, a good alternative that always prints on the
client should also be supported. For this case, maybe we can add the
collect method and return an Iterator. With the iterator, uses can print
the content on the client. This is also consistent with the behavior in
Table API.

6. For Row.
Do you mean that we should not expose the Row type in Python API? Maybe I
haven't gotten your concerns well.
We can use tuple type in Python DataStream to support Row. (I have updated
the example section of the FLIP to reflect the design.)

Highly appreciated for your suggestions again. Looking forward to your
feedback.

Best,
Shuiqiang

Aljoscha Krettek <al...@apache.org> 于2020年7月15日周三 下午5:58写道：

> Hi,
>
> thanks for the proposal! I have some comments about the API. We should not
> blindly copy the existing Java DataSteam because we made some mistakes with
> that and we now have a chance to fix them and not forward them to a new API.
>
> I don't think we need SingleOutputStreamOperator, in the Scala API we just
> have DataStream and the relevant methods from SingleOutputStreamOperator
> are added to DataStream. Having this extra type is more confusing than
> helpful to users, I think. In the same vain, I think we also don't need
> DataStreamSource. The source methods can also just return a DataStream.
>
> There are some methods that I would consider internal and we shouldn't
> expose them:
>  - DataStream.get_id(): this is an internal method
>  - DataStream.partition_custom(): I think adding this method was a mistake
> because it's to low-level, I could be convinced otherwise
>  - DataStream.print()/DataStream.print_to_error(): These are questionable
> because they print to the TaskManager log. Maybe we could add a good
> alternative that always prints on the client, similar to the Table API
>  - DataStream.write_to_socket(): It was a mistake to add this sink on
> DataStream it is not fault-tolerant and shouldn't be used in production
>
>  - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API should
> be used for "declarative" use cases and I think these methods should not be
> in the DataStream API
>  - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: these are
> internal methods
>
>  - StreamExecutionEnvironment.from_parallel_collection(): I think the
> usability is questionable
>  - StreamExecutionEnvironment.from_collections -> should be called
> from_collection
>  - StreamExecutionEnvironment.generate_sequenece -> should be called
> generate_sequence
>
> I think most of the predefined sources are questionable:
>  - fromParallelCollection: I don't know if this is useful
>  - readTextFile: most of the variants are not useful/fault-tolerant
>  - readFile: same
>  - socketTextStream: also not useful except for toy examples
>  - createInput: also not useful, and it's legacy DataSet InputFormats
>
> I think we need to think hard whether we want to further expose Row in our
> APIs. I think adding it to flink-core was more an accident than anything
> else but I can see that it would be useful for Python/Java interop.
>
> Best,
> Aljoscha
>
>
> On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote:
> > Thanks for bring up this DISCUSS Shuiqiang!
> >
> > +1 for the proposal!
> >
> > Best,
> > Jincheng
> >
> >
> > Xingbo Huang <hx...@gmail.com> 于2020年7月9日周四 上午10:41写道：
> >
> > > Hi Shuiqiang,
> > >
> > > Thanks a lot for driving this discussion.
> > > Big +1 for supporting Python DataStream.
> > > In many ML scenarios, operating Object will be more natural than
> operating
> > > Table.
> > >
> > > Best,
> > > Xingbo
> > >
> > > Wei Zhong <we...@gmail.com> 于2020年7月9日周四 上午10:35写道：
> > >
> > > > Hi Shuiqiang,
> > > >
> > > > Thanks for driving this. Big +1 for supporting DataStream API in
> PyFlink!
> > > >
> > > > Best,
> > > > Wei
> > > >
> > > >
> > > > > 在 2020年7月9日，10:29，Hequn Cheng <he...@apache.org> 写道：
> > > > >
> > > > > +1 for adding the Python DataStream API and starting with the
> stateless
> > > > > part.
> > > > > There are already some users that expressed their wish to have the
> > > Python
> > > > > DataStream APIs. Once we have the APIs in PyFlink, we can cover
> more
> > > use
> > > > > cases for our users.
> > > > >
> > > > > Best, Hequn
> > > > >
> > > > > On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <
> acqua.csq@gmail.com>
> > > > wrote:
> > > > >
> > > > >> Sorry, the 3rd link is broken, please refer to this one: Support
> > > Python
> > > > >> DataStream API
> > > > >> <
> > > > >>
> > > >
> > >
> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
> > > > >>>
> > > > >>
> > > > >> Shuiqiang Chen <ac...@gmail.com> 于2020年7月8日周三 上午11:13写道：
> > > > >>
> > > > >>> Hi everyone,
> > > > >>>
> > > > >>> As we all know, Flink provides three layered APIs: the
> > > > ProcessFunctions,
> > > > >>> the DataStream API and the SQL & Table API. Each API offers a
> > > different
> > > > >>> trade-off between conciseness and expressiveness and targets
> > > different
> > > > >> use
> > > > >>> cases[1].
> > > > >>>
> > > > >>> Currently, the SQL & Table API has already been supported in
> PyFlink.
> > > > The
> > > > >>> API provides relational operations as well as user-defined
> functions
> > > to
> > > > >>> provide convenience for users who are familiar with python and
> > > > relational
> > > > >>> programming.
> > > > >>>
> > > > >>> Meanwhile, the DataStream API and ProcessFunctions provide more
> > > generic
> > > > >>> APIs to implement stream processing applications. The
> > > ProcessFunctions
> > > > >>> expose time and state which are the fundamental building blocks
> for
> > > any
> > > > >>> kind of streaming application.
> > > > >>> To cover more use cases, we are planning to cover all these APIs
> in
> > > > >>> PyFlink.
> > > > >>>
> > > > >>> In this discussion(FLIP-130), we propose to support the Python
> > > > DataStream
> > > > >>> API for the stateless part. For more detail, please refer to the
> FLIP
> > > > >> wiki
> > > > >>> page here[2]. If interested in the stateful part, you can also
> take a
> > > > >>> look the design doc here[3] for which we are going to discuss in
> a
> > > > >> separate
> > > > >>> FLIP.
> > > > >>>
> > > > >>> Any comments will be highly appreciated!
> > > > >>>
> > > > >>> [1]
> https://flink.apache.org/flink-applications.html#layered-apis
> > > > >>> [2]
> > > > >>>
> > > > >>
> > > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
> > > > >>> [3]
> > > > >>>
> > > > >>
> > > >
> > >
> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
> > > > >>>
> > > > >>> Best,
> > > > >>> Shuiqiang
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Posted by Aljoscha Krettek <al...@apache.org>.

Hi,

thanks for the proposal! I have some comments about the API. We should not blindly copy the existing Java DataSteam because we made some mistakes with that and we now have a chance to fix them and not forward them to a new API.

I don't think we need SingleOutputStreamOperator, in the Scala API we just have DataStream and the relevant methods from SingleOutputStreamOperator are added to DataStream. Having this extra type is more confusing than helpful to users, I think. In the same vain, I think we also don't need DataStreamSource. The source methods can also just return a DataStream.

There are some methods that I would consider internal and we shouldn't expose them:
 - DataStream.get_id(): this is an internal method
 - DataStream.partition_custom(): I think adding this method was a mistake because it's to low-level, I could be convinced otherwise
 - DataStream.print()/DataStream.print_to_error(): These are questionable because they print to the TaskManager log. Maybe we could add a good alternative that always prints on the client, similar to the Table API
 - DataStream.write_to_socket(): It was a mistake to add this sink on DataStream it is not fault-tolerant and shouldn't be used in production

 - KeyedStream.sum/min/max/min_by/max_by: Nowadays, the Table API should be used for "declarative" use cases and I think these methods should not be in the DataStream API
 - SingleOutputStreamOperator.can_be_parallel/forceNoParallel: these are internal methods

 - StreamExecutionEnvironment.from_parallel_collection(): I think the usability is questionable
 - StreamExecutionEnvironment.from_collections -> should be called from_collection
 - StreamExecutionEnvironment.generate_sequenece -> should be called generate_sequence

I think most of the predefined sources are questionable:
 - fromParallelCollection: I don't know if this is useful
 - readTextFile: most of the variants are not useful/fault-tolerant
 - readFile: same
 - socketTextStream: also not useful except for toy examples
 - createInput: also not useful, and it's legacy DataSet InputFormats

I think we need to think hard whether we want to further expose Row in our APIs. I think adding it to flink-core was more an accident than anything else but I can see that it would be useful for Python/Java interop.

Best,
Aljoscha


On Mon, Jul 13, 2020, at 04:38, jincheng sun wrote:
> Thanks for bring up this DISCUSS Shuiqiang!
> 
> +1 for the proposal!
> 
> Best,
> Jincheng
> 
> 
> Xingbo Huang <hx...@gmail.com> 于2020年7月9日周四 上午10:41写道：
> 
> > Hi Shuiqiang,
> >
> > Thanks a lot for driving this discussion.
> > Big +1 for supporting Python DataStream.
> > In many ML scenarios, operating Object will be more natural than operating
> > Table.
> >
> > Best,
> > Xingbo
> >
> > Wei Zhong <we...@gmail.com> 于2020年7月9日周四 上午10:35写道：
> >
> > > Hi Shuiqiang,
> > >
> > > Thanks for driving this. Big +1 for supporting DataStream API in PyFlink!
> > >
> > > Best,
> > > Wei
> > >
> > >
> > > > 在 2020年7月9日，10:29，Hequn Cheng <he...@apache.org> 写道：
> > > >
> > > > +1 for adding the Python DataStream API and starting with the stateless
> > > > part.
> > > > There are already some users that expressed their wish to have the
> > Python
> > > > DataStream APIs. Once we have the APIs in PyFlink, we can cover more
> > use
> > > > cases for our users.
> > > >
> > > > Best, Hequn
> > > >
> > > > On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <ac...@gmail.com>
> > > wrote:
> > > >
> > > >> Sorry, the 3rd link is broken, please refer to this one: Support
> > Python
> > > >> DataStream API
> > > >> <
> > > >>
> > >
> > https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
> > > >>>
> > > >>
> > > >> Shuiqiang Chen <ac...@gmail.com> 于2020年7月8日周三 上午11:13写道：
> > > >>
> > > >>> Hi everyone,
> > > >>>
> > > >>> As we all know, Flink provides three layered APIs: the
> > > ProcessFunctions,
> > > >>> the DataStream API and the SQL & Table API. Each API offers a
> > different
> > > >>> trade-off between conciseness and expressiveness and targets
> > different
> > > >> use
> > > >>> cases[1].
> > > >>>
> > > >>> Currently, the SQL & Table API has already been supported in PyFlink.
> > > The
> > > >>> API provides relational operations as well as user-defined functions
> > to
> > > >>> provide convenience for users who are familiar with python and
> > > relational
> > > >>> programming.
> > > >>>
> > > >>> Meanwhile, the DataStream API and ProcessFunctions provide more
> > generic
> > > >>> APIs to implement stream processing applications. The
> > ProcessFunctions
> > > >>> expose time and state which are the fundamental building blocks for
> > any
> > > >>> kind of streaming application.
> > > >>> To cover more use cases, we are planning to cover all these APIs in
> > > >>> PyFlink.
> > > >>>
> > > >>> In this discussion(FLIP-130), we propose to support the Python
> > > DataStream
> > > >>> API for the stateless part. For more detail, please refer to the FLIP
> > > >> wiki
> > > >>> page here[2]. If interested in the stateful part, you can also take a
> > > >>> look the design doc here[3] for which we are going to discuss in a
> > > >> separate
> > > >>> FLIP.
> > > >>>
> > > >>> Any comments will be highly appreciated!
> > > >>>
> > > >>> [1] https://flink.apache.org/flink-applications.html#layered-apis
> > > >>> [2]
> > > >>>
> > > >>
> > >
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
> > > >>> [3]
> > > >>>
> > > >>
> > >
> > https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
> > > >>>
> > > >>> Best,
> > > >>> Shuiqiang
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > >
> > >
> >
>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Posted by jincheng sun <su...@gmail.com>.

Thanks for bring up this DISCUSS Shuiqiang!

+1 for the proposal!

Best,
Jincheng


Xingbo Huang <hx...@gmail.com> 于2020年7月9日周四 上午10:41写道：

> Hi Shuiqiang,
>
> Thanks a lot for driving this discussion.
> Big +1 for supporting Python DataStream.
> In many ML scenarios, operating Object will be more natural than operating
> Table.
>
> Best,
> Xingbo
>
> Wei Zhong <we...@gmail.com> 于2020年7月9日周四 上午10:35写道：
>
> > Hi Shuiqiang,
> >
> > Thanks for driving this. Big +1 for supporting DataStream API in PyFlink!
> >
> > Best,
> > Wei
> >
> >
> > > 在 2020年7月9日，10:29，Hequn Cheng <he...@apache.org> 写道：
> > >
> > > +1 for adding the Python DataStream API and starting with the stateless
> > > part.
> > > There are already some users that expressed their wish to have the
> Python
> > > DataStream APIs. Once we have the APIs in PyFlink, we can cover more
> use
> > > cases for our users.
> > >
> > > Best, Hequn
> > >
> > > On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <ac...@gmail.com>
> > wrote:
> > >
> > >> Sorry, the 3rd link is broken, please refer to this one: Support
> Python
> > >> DataStream API
> > >> <
> > >>
> >
> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
> > >>>
> > >>
> > >> Shuiqiang Chen <ac...@gmail.com> 于2020年7月8日周三 上午11:13写道：
> > >>
> > >>> Hi everyone,
> > >>>
> > >>> As we all know, Flink provides three layered APIs: the
> > ProcessFunctions,
> > >>> the DataStream API and the SQL & Table API. Each API offers a
> different
> > >>> trade-off between conciseness and expressiveness and targets
> different
> > >> use
> > >>> cases[1].
> > >>>
> > >>> Currently, the SQL & Table API has already been supported in PyFlink.
> > The
> > >>> API provides relational operations as well as user-defined functions
> to
> > >>> provide convenience for users who are familiar with python and
> > relational
> > >>> programming.
> > >>>
> > >>> Meanwhile, the DataStream API and ProcessFunctions provide more
> generic
> > >>> APIs to implement stream processing applications. The
> ProcessFunctions
> > >>> expose time and state which are the fundamental building blocks for
> any
> > >>> kind of streaming application.
> > >>> To cover more use cases, we are planning to cover all these APIs in
> > >>> PyFlink.
> > >>>
> > >>> In this discussion(FLIP-130), we propose to support the Python
> > DataStream
> > >>> API for the stateless part. For more detail, please refer to the FLIP
> > >> wiki
> > >>> page here[2]. If interested in the stateful part, you can also take a
> > >>> look the design doc here[3] for which we are going to discuss in a
> > >> separate
> > >>> FLIP.
> > >>>
> > >>> Any comments will be highly appreciated!
> > >>>
> > >>> [1] https://flink.apache.org/flink-applications.html#layered-apis
> > >>> [2]
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
> > >>> [3]
> > >>>
> > >>
> >
> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
> > >>>
> > >>> Best,
> > >>> Shuiqiang
> > >>>
> > >>>
> > >>>
> > >>>
> > >>
> >
> >
>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Posted by Xingbo Huang <hx...@gmail.com>.

Hi Shuiqiang,

Thanks a lot for driving this discussion.
Big +1 for supporting Python DataStream.
In many ML scenarios, operating Object will be more natural than operating
Table.

Best,
Xingbo

Wei Zhong <we...@gmail.com> 于2020年7月9日周四 上午10:35写道：

> Hi Shuiqiang,
>
> Thanks for driving this. Big +1 for supporting DataStream API in PyFlink!
>
> Best,
> Wei
>
>
> > 在 2020年7月9日，10:29，Hequn Cheng <he...@apache.org> 写道：
> >
> > +1 for adding the Python DataStream API and starting with the stateless
> > part.
> > There are already some users that expressed their wish to have the Python
> > DataStream APIs. Once we have the APIs in PyFlink, we can cover more use
> > cases for our users.
> >
> > Best, Hequn
> >
> > On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <ac...@gmail.com>
> wrote:
> >
> >> Sorry, the 3rd link is broken, please refer to this one: Support Python
> >> DataStream API
> >> <
> >>
> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
> >>>
> >>
> >> Shuiqiang Chen <ac...@gmail.com> 于2020年7月8日周三 上午11:13写道：
> >>
> >>> Hi everyone,
> >>>
> >>> As we all know, Flink provides three layered APIs: the
> ProcessFunctions,
> >>> the DataStream API and the SQL & Table API. Each API offers a different
> >>> trade-off between conciseness and expressiveness and targets different
> >> use
> >>> cases[1].
> >>>
> >>> Currently, the SQL & Table API has already been supported in PyFlink.
> The
> >>> API provides relational operations as well as user-defined functions to
> >>> provide convenience for users who are familiar with python and
> relational
> >>> programming.
> >>>
> >>> Meanwhile, the DataStream API and ProcessFunctions provide more generic
> >>> APIs to implement stream processing applications. The ProcessFunctions
> >>> expose time and state which are the fundamental building blocks for any
> >>> kind of streaming application.
> >>> To cover more use cases, we are planning to cover all these APIs in
> >>> PyFlink.
> >>>
> >>> In this discussion(FLIP-130), we propose to support the Python
> DataStream
> >>> API for the stateless part. For more detail, please refer to the FLIP
> >> wiki
> >>> page here[2]. If interested in the stateful part, you can also take a
> >>> look the design doc here[3] for which we are going to discuss in a
> >> separate
> >>> FLIP.
> >>>
> >>> Any comments will be highly appreciated!
> >>>
> >>> [1] https://flink.apache.org/flink-applications.html#layered-apis
> >>> [2]
> >>>
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
> >>> [3]
> >>>
> >>
> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
> >>>
> >>> Best,
> >>> Shuiqiang
> >>>
> >>>
> >>>
> >>>
> >>
>
>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Posted by Wei Zhong <we...@gmail.com>.

Hi Shuiqiang,

Thanks for driving this. Big +1 for supporting DataStream API in PyFlink!

Best,
Wei


> 在 2020年7月9日，10:29，Hequn Cheng <he...@apache.org> 写道：
> 
> +1 for adding the Python DataStream API and starting with the stateless
> part.
> There are already some users that expressed their wish to have the Python
> DataStream APIs. Once we have the APIs in PyFlink, we can cover more use
> cases for our users.
> 
> Best, Hequn
> 
> On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <ac...@gmail.com> wrote:
> 
>> Sorry, the 3rd link is broken, please refer to this one: Support Python
>> DataStream API
>> <
>> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
>>> 
>> 
>> Shuiqiang Chen <ac...@gmail.com> 于2020年7月8日周三 上午11:13写道：
>> 
>>> Hi everyone,
>>> 
>>> As we all know, Flink provides three layered APIs: the ProcessFunctions,
>>> the DataStream API and the SQL & Table API. Each API offers a different
>>> trade-off between conciseness and expressiveness and targets different
>> use
>>> cases[1].
>>> 
>>> Currently, the SQL & Table API has already been supported in PyFlink. The
>>> API provides relational operations as well as user-defined functions to
>>> provide convenience for users who are familiar with python and relational
>>> programming.
>>> 
>>> Meanwhile, the DataStream API and ProcessFunctions provide more generic
>>> APIs to implement stream processing applications. The ProcessFunctions
>>> expose time and state which are the fundamental building blocks for any
>>> kind of streaming application.
>>> To cover more use cases, we are planning to cover all these APIs in
>>> PyFlink.
>>> 
>>> In this discussion(FLIP-130), we propose to support the Python DataStream
>>> API for the stateless part. For more detail, please refer to the FLIP
>> wiki
>>> page here[2]. If interested in the stateful part, you can also take a
>>> look the design doc here[3] for which we are going to discuss in a
>> separate
>>> FLIP.
>>> 
>>> Any comments will be highly appreciated!
>>> 
>>> [1] https://flink.apache.org/flink-applications.html#layered-apis
>>> [2]
>>> 
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
>>> [3]
>>> 
>> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
>>> 
>>> Best,
>>> Shuiqiang
>>> 
>>> 
>>> 
>>> 
>>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Posted by Hequn Cheng <he...@apache.org>.

+1 for adding the Python DataStream API and starting with the stateless
part.
There are already some users that expressed their wish to have the Python
DataStream APIs. Once we have the APIs in PyFlink, we can cover more use
cases for our users.

Best, Hequn

On Wed, Jul 8, 2020 at 11:45 AM Shuiqiang Chen <ac...@gmail.com> wrote:

> Sorry, the 3rd link is broken, please refer to this one: Support Python
> DataStream API
> <
> https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit
> >
>
> Shuiqiang Chen <ac...@gmail.com> 于2020年7月8日周三 上午11:13写道：
>
> > Hi everyone,
> >
> > As we all know, Flink provides three layered APIs: the ProcessFunctions,
> > the DataStream API and the SQL & Table API. Each API offers a different
> > trade-off between conciseness and expressiveness and targets different
> use
> > cases[1].
> >
> > Currently, the SQL & Table API has already been supported in PyFlink. The
> > API provides relational operations as well as user-defined functions to
> > provide convenience for users who are familiar with python and relational
> > programming.
> >
> > Meanwhile, the DataStream API and ProcessFunctions provide more generic
> > APIs to implement stream processing applications. The ProcessFunctions
> > expose time and state which are the fundamental building blocks for any
> > kind of streaming application.
> > To cover more use cases, we are planning to cover all these APIs in
> > PyFlink.
> >
> > In this discussion(FLIP-130), we propose to support the Python DataStream
> > API for the stateless part. For more detail, please refer to the FLIP
> wiki
> > page here[2]. If interested in the stateful part, you can also take a
> > look the design doc here[3] for which we are going to discuss in a
> separate
> > FLIP.
> >
> > Any comments will be highly appreciated!
> >
> > [1] https://flink.apache.org/flink-applications.html#layered-apis
> > [2]
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
> > [3]
> >
> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
> >
> > Best,
> > Shuiqiang
> >
> >
> >
> >
>

Re: [DISCUSS] FLIP-130: Support for Python DataStream API (Stateless Part)

Posted by Shuiqiang Chen <ac...@gmail.com>.

Sorry, the 3rd link is broken, please refer to this one: Support Python
DataStream API
<https://docs.google.com/document/d/1H3hz8wuk22-8cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit>

Shuiqiang Chen <ac...@gmail.com> 于2020年7月8日周三 上午11:13写道：

> Hi everyone,
>
> As we all know, Flink provides three layered APIs: the ProcessFunctions,
> the DataStream API and the SQL & Table API. Each API offers a different
> trade-off between conciseness and expressiveness and targets different use
> cases[1].
>
> Currently, the SQL & Table API has already been supported in PyFlink. The
> API provides relational operations as well as user-defined functions to
> provide convenience for users who are familiar with python and relational
> programming.
>
> Meanwhile, the DataStream API and ProcessFunctions provide more generic
> APIs to implement stream processing applications. The ProcessFunctions
> expose time and state which are the fundamental building blocks for any
> kind of streaming application.
> To cover more use cases, we are planning to cover all these APIs in
> PyFlink.
>
> In this discussion(FLIP-130), we propose to support the Python DataStream
> API for the stateless part. For more detail, please refer to the FLIP wiki
> page here[2]. If interested in the stateful part, you can also take a
> look the design doc here[3] for which we are going to discuss in a separate
> FLIP.
>
> Any comments will be highly appreciated!
>
> [1] https://flink.apache.org/flink-applications.html#layered-apis
> [2]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866298
> [3]
> https://docs.google.com/document/d/1H3hz8wuk228cDBhQmQKNw3m1q5gDAMkwTDEwnj3FBI/edit?usp=sharing
>
> Best,
> Shuiqiang
>
>
>
>