You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Dian Fu <di...@gmail.com> on 2021/01/04 12:03:10 UTC

[DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Hi all,

I'd like to start a discussion about introducing a few convenient operations in Table API from the perspective of ease of use. 

Currently some tasks are not easy to express in Table API e.g. deduplication, topn, etc, or not easy to express when there are hundreds of columns in a table, e.g. null data handling, etc.

I'd like to propose to introduce a few operations in Table API with the following purposes:
- Make Table API users to easily leverage the powerful features already in SQL, e.g. deduplication, topn, etc
- Provide some convenient operations, e.g. introducing a series of operations for null data handling (it may become a problem when there are hundreds of columns), data sampling and splitting (which is a very common use case in ML which usually needs to split a table into multiple tables for training and validation separately).

Please refer to FLIP-155 [1] for more details.

Looking forward to your feedback!

Regards,
Dian

[1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API

Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Posted by Timo Walther <tw...@apache.org>.
Hi Dian,

Thanks for working on improving the Table API. I went through the entire 
FLIP and many functions definitely make sense. However, we need to make 
sure that the general API naming, behavior etc. remains consistent.

Here is some feedback from my side:

1) deduplicate
Are we planning to overload this method in Java or do users always have 
to provide all 3 parameters? I'm asking because I find the `new 
Expression[] {$("a"), $("b")}` not very fluent. It would be better to 
have varargs at the end of the method signature instead.

If this is not possible, maybe we could also think about forcing 
`withColumns()`/`withoutColumns` programatically at those locations 
instead of using arrays. Maybe we can introduce a `ExpressionList` that 
is returned by `withColumns`/`withoutColumns`.

Are users able to define `asc` or `desc` for the `orderField`?

2) topn

Rename to just `top` such that it reads `top(3)`?

Can't we use the parts of the API for this task? And introduce a 
`paritionBy` clause: `Table.partitionBy(...).orderBy(...).limit(3)`

Actually, we could use a similar syntax for deduplicate as well: 
`Table.partitionBy(...).orderBy(...).deduplicate()`

3) hint

How can we guarantee the same API for Scala and Java? Because 
`java.util.Map<String, String>` would require to perform collection 
transformations for Scala users. Can we introduce a fluent way to unify 
the two APIs?

For example, add a dedicated method for all kinds hints?
```
   table
     .hintOption(String key, String value)
     .hintOption(String key, String value)
     .hintOption(String key, String value)
```

4) fillna

I don't find this name intuitive, it also doesn't match to the other 
methods of the API.

How about `replaceNull()`?

In general, I'm wondering here if we should rather introduce a lambda 
like function that would serve a variety of use cases:

Just an initial example:
```
table.mapColumns(e -> e.ifNull(1))
table.mapColumns(e -> e.ifNull(1), ExpressionList)
```

5) dropna

Is this really useful? This sounds like a rarely used method.

6) replace

Similar to other proposed methods, we will have issues with the Scala 
API when using a java.util.Map.

Furthermore this map also take expression instead of objects.


Let me know what you think.

Regards,
Timo




On 06.01.21 05:00, Dian Fu wrote:
> Hi all,
> 
> I have updated the FLIP about temporal join, sql hints and window TVF.
> 
> Regards,
> Dian
> 
>> 在 2021年1月5日,上午11:58,Dian Fu <di...@gmail.com> 写道:
>>
>> Thanks a lot for your comments!
>>
>> Regarding to Python Table API examples: I thought it should be straightforward about how to use these operations in Python Table API and so have not added them. However, the suggestions make sense to me and I have added some examples about how to use them in Python Table API to make it more clear.
>>
>> Regarding to dropDuplicates vs deduplicate: +1 to use deduplicate. It's more consistent with the feature/concept which is already documented clearly in Flink.
>>
>> Regarding to `myTable.coalesce($("a"), 1).as("a")`: I'm still in favor of fillna for now. Compared to coalesce, fillna could handle multiple columns in one method call. For the naming convention, the name "fillna/dropna/replace" comes from Pandas [1][2][3].
>>
>> Regarding to `event-time/processing-time temporal join, SQL Hints, window TVF`: Good catch! Definitely we should support them in Table API. I will update the FLIP about these functionalities.
>>
>> [1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html>
>> [2] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html>
>> [3] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html>
>>> 在 2021年1月4日,下午10:59,Timo Walther <twalthr@apache.org <ma...@apache.org>> 写道:
>>>
>>> Hi Dian,
>>>
>>> thanks for the proposed FLIP. I haven't taken a deep look at the proposal yet but will do so shortly. In general, we should aim to make the Table API as concise and self-explaining as possible. E.g. `dropna` does not sound obvious to me.
>>>
>>> Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing more top-level functions, maybe we should also consider introducing more building blocks e.g. for applying an expression to every column. A more functional approach (e.g. with lamba function) could solve more use cases.
>>>
>>> Regards,
>>> Timo
>>>
>>> On 04.01.21 15:35, Seth Wiesman wrote:
>>>> This makes sense, I have some questions about method names.
>>>> What do you think about renaming `dropDuplicates` to `deduplicate`? I don't
>>>> think that drop is the right word to use for this operation, it implies
>>>> records are filtered where this operator actually issues updates and
>>>> retractions. Also, deduplicate is already how we talk about this feature in
>>>> the docs so I think it would be easier for users to find.
>>>> For null handling, I don't know how close we want to stick with SQL
>>>> conventions but what about making `coalesce` a top-level method? Something
>>>> like:
>>>> myTable.coalesce($("a"), 1).as("a")
>>>> We can require the next method to be an `as`. There is already precedent
>>>> for this sort of thing, `GroupedTable#aggregate` can only be followed by
>>>> `select`.
>>>> Seth
>>>> On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <weizhong0618@gmail.com <ma...@gmail.com>> wrote:
>>>>> Hi Dian,
>>>>>
>>>>> Big +1 for making the Table API easier to use. Java users and Python users
>>>>> can both benefit from it. I think it would be better if we add some Python
>>>>> API examples.
>>>>>
>>>>> Best,
>>>>> Wei
>>>>>
>>>>>
>>>>>> 在 2021年1月4日,20:03,Dian Fu <dian0511.fu@gmail.com <ma...@gmail.com>> 写道:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'd like to start a discussion about introducing a few convenient
>>>>> operations in Table API from the perspective of ease of use.
>>>>>>
>>>>>> Currently some tasks are not easy to express in Table API e.g.
>>>>> deduplication, topn, etc, or not easy to express when there are hundreds of
>>>>> columns in a table, e.g. null data handling, etc.
>>>>>>
>>>>>> I'd like to propose to introduce a few operations in Table API with the
>>>>> following purposes:
>>>>>> - Make Table API users to easily leverage the powerful features already
>>>>> in SQL, e.g. deduplication, topn, etc
>>>>>> - Provide some convenient operations, e.g. introducing a series of
>>>>> operations for null data handling (it may become a problem when there are
>>>>> hundreds of columns), data sampling and splitting (which is a very common
>>>>> use case in ML which usually needs to split a table into multiple tables
>>>>> for training and validation separately).
>>>>>>
>>>>>> Please refer to FLIP-155 [1] for more details.
>>>>>>
>>>>>> Looking forward to your feedback!
>>>>>>
>>>>>> Regards,
>>>>>> Dian
>>>>>>
>>>>>> [1]
>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API <https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API>
>>>>>
>>>>>
>>>
>>
> 
> 


Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Posted by Dian Fu <di...@gmail.com>.
Hi all,

I have updated the FLIP about temporal join, sql hints and window TVF.

Regards,
Dian

> 在 2021年1月5日,上午11:58,Dian Fu <di...@gmail.com> 写道:
> 
> Thanks a lot for your comments!
> 
> Regarding to Python Table API examples: I thought it should be straightforward about how to use these operations in Python Table API and so have not added them. However, the suggestions make sense to me and I have added some examples about how to use them in Python Table API to make it more clear.
> 
> Regarding to dropDuplicates vs deduplicate: +1 to use deduplicate. It's more consistent with the feature/concept which is already documented clearly in Flink.
> 
> Regarding to `myTable.coalesce($("a"), 1).as("a")`: I'm still in favor of fillna for now. Compared to coalesce, fillna could handle multiple columns in one method call. For the naming convention, the name "fillna/dropna/replace" comes from Pandas [1][2][3].
> 
> Regarding to `event-time/processing-time temporal join, SQL Hints, window TVF`: Good catch! Definitely we should support them in Table API. I will update the FLIP about these functionalities.
> 
> [1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html>
> [2] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html>
> [3] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html>
>> 在 2021年1月4日,下午10:59,Timo Walther <twalthr@apache.org <ma...@apache.org>> 写道:
>> 
>> Hi Dian,
>> 
>> thanks for the proposed FLIP. I haven't taken a deep look at the proposal yet but will do so shortly. In general, we should aim to make the Table API as concise and self-explaining as possible. E.g. `dropna` does not sound obvious to me.
>> 
>> Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing more top-level functions, maybe we should also consider introducing more building blocks e.g. for applying an expression to every column. A more functional approach (e.g. with lamba function) could solve more use cases.
>> 
>> Regards,
>> Timo
>> 
>> On 04.01.21 15:35, Seth Wiesman wrote:
>>> This makes sense, I have some questions about method names.
>>> What do you think about renaming `dropDuplicates` to `deduplicate`? I don't
>>> think that drop is the right word to use for this operation, it implies
>>> records are filtered where this operator actually issues updates and
>>> retractions. Also, deduplicate is already how we talk about this feature in
>>> the docs so I think it would be easier for users to find.
>>> For null handling, I don't know how close we want to stick with SQL
>>> conventions but what about making `coalesce` a top-level method? Something
>>> like:
>>> myTable.coalesce($("a"), 1).as("a")
>>> We can require the next method to be an `as`. There is already precedent
>>> for this sort of thing, `GroupedTable#aggregate` can only be followed by
>>> `select`.
>>> Seth
>>> On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <weizhong0618@gmail.com <ma...@gmail.com>> wrote:
>>>> Hi Dian,
>>>> 
>>>> Big +1 for making the Table API easier to use. Java users and Python users
>>>> can both benefit from it. I think it would be better if we add some Python
>>>> API examples.
>>>> 
>>>> Best,
>>>> Wei
>>>> 
>>>> 
>>>>> 在 2021年1月4日,20:03,Dian Fu <dian0511.fu@gmail.com <ma...@gmail.com>> 写道:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I'd like to start a discussion about introducing a few convenient
>>>> operations in Table API from the perspective of ease of use.
>>>>> 
>>>>> Currently some tasks are not easy to express in Table API e.g.
>>>> deduplication, topn, etc, or not easy to express when there are hundreds of
>>>> columns in a table, e.g. null data handling, etc.
>>>>> 
>>>>> I'd like to propose to introduce a few operations in Table API with the
>>>> following purposes:
>>>>> - Make Table API users to easily leverage the powerful features already
>>>> in SQL, e.g. deduplication, topn, etc
>>>>> - Provide some convenient operations, e.g. introducing a series of
>>>> operations for null data handling (it may become a problem when there are
>>>> hundreds of columns), data sampling and splitting (which is a very common
>>>> use case in ML which usually needs to split a table into multiple tables
>>>> for training and validation separately).
>>>>> 
>>>>> Please refer to FLIP-155 [1] for more details.
>>>>> 
>>>>> Looking forward to your feedback!
>>>>> 
>>>>> Regards,
>>>>> Dian
>>>>> 
>>>>> [1]
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API <https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API>
>>>> 
>>>> 
>> 
> 


Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Posted by Dian Fu <di...@gmail.com>.
Thanks a lot for your comments!

Regarding to Python Table API examples: I thought it should be straightforward about how to use these operations in Python Table API and so have not added them. However, the suggestions make sense to me and I have added some examples about how to use them in Python Table API to make it more clear.

Regarding to dropDuplicates vs deduplicate: +1 to use deduplicate. It's more consistent with the feature/concept which is already documented clearly in Flink.

Regarding to `myTable.coalesce($("a"), 1).as("a")`: I'm still in favor of fillna for now. Compared to coalesce, fillna could handle multiple columns in one method call. For the naming convention, the name "fillna/dropna/replace" comes from Pandas [1][2][3].

Regarding to `event-time/processing-time temporal join, SQL Hints, window TVF`: Good catch! Definitely we should support them in Table API. I will update the FLIP about these functionalities.

[1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html>
[2] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html>
[3] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html>
> 在 2021年1月4日,下午10:59,Timo Walther <tw...@apache.org> 写道:
> 
> Hi Dian,
> 
> thanks for the proposed FLIP. I haven't taken a deep look at the proposal yet but will do so shortly. In general, we should aim to make the Table API as concise and self-explaining as possible. E.g. `dropna` does not sound obvious to me.
> 
> Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing more top-level functions, maybe we should also consider introducing more building blocks e.g. for applying an expression to every column. A more functional approach (e.g. with lamba function) could solve more use cases.
> 
> Regards,
> Timo
> 
> On 04.01.21 15:35, Seth Wiesman wrote:
>> This makes sense, I have some questions about method names.
>> What do you think about renaming `dropDuplicates` to `deduplicate`? I don't
>> think that drop is the right word to use for this operation, it implies
>> records are filtered where this operator actually issues updates and
>> retractions. Also, deduplicate is already how we talk about this feature in
>> the docs so I think it would be easier for users to find.
>> For null handling, I don't know how close we want to stick with SQL
>> conventions but what about making `coalesce` a top-level method? Something
>> like:
>> myTable.coalesce($("a"), 1).as("a")
>> We can require the next method to be an `as`. There is already precedent
>> for this sort of thing, `GroupedTable#aggregate` can only be followed by
>> `select`.
>> Seth
>> On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <we...@gmail.com> wrote:
>>> Hi Dian,
>>> 
>>> Big +1 for making the Table API easier to use. Java users and Python users
>>> can both benefit from it. I think it would be better if we add some Python
>>> API examples.
>>> 
>>> Best,
>>> Wei
>>> 
>>> 
>>>> 在 2021年1月4日,20:03,Dian Fu <di...@gmail.com> 写道:
>>>> 
>>>> Hi all,
>>>> 
>>>> I'd like to start a discussion about introducing a few convenient
>>> operations in Table API from the perspective of ease of use.
>>>> 
>>>> Currently some tasks are not easy to express in Table API e.g.
>>> deduplication, topn, etc, or not easy to express when there are hundreds of
>>> columns in a table, e.g. null data handling, etc.
>>>> 
>>>> I'd like to propose to introduce a few operations in Table API with the
>>> following purposes:
>>>> - Make Table API users to easily leverage the powerful features already
>>> in SQL, e.g. deduplication, topn, etc
>>>> - Provide some convenient operations, e.g. introducing a series of
>>> operations for null data handling (it may become a problem when there are
>>> hundreds of columns), data sampling and splitting (which is a very common
>>> use case in ML which usually needs to split a table into multiple tables
>>> for training and validation separately).
>>>> 
>>>> Please refer to FLIP-155 [1] for more details.
>>>> 
>>>> Looking forward to your feedback!
>>>> 
>>>> Regards,
>>>> Dian
>>>> 
>>>> [1]
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API
>>> 
>>> 
> 


Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Posted by Jark Wu <im...@gmail.com>.
Thanks Dian,

+1 to `deduplicate`.

Regarding `myTable.coalesce($("a"), 1).as("a")`, I'm afraid it may
conflict/confuse the built-in expression `coalesce(f0, 0)` (we may
introduce it in the future).

Besides that, could we also align other features of Flink SQL, e.g.
event-time/processing-time temporal join, SQL Hints, window TVF (FLIP-145
[1])?

Best,
Jark

[1]:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function





On Mon, 4 Jan 2021 at 22:59, Timo Walther <tw...@apache.org> wrote:

> Hi Dian,
>
> thanks for the proposed FLIP. I haven't taken a deep look at the
> proposal yet but will do so shortly. In general, we should aim to make
> the Table API as concise and self-explaining as possible. E.g. `dropna`
> does not sound obvious to me.
>
> Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing
> more top-level functions, maybe we should also consider introducing more
> building blocks e.g. for applying an expression to every column. A more
> functional approach (e.g. with lamba function) could solve more use cases.
>
> Regards,
> Timo
>
> On 04.01.21 15:35, Seth Wiesman wrote:
> > This makes sense, I have some questions about method names.
> >
> > What do you think about renaming `dropDuplicates` to `deduplicate`? I
> don't
> > think that drop is the right word to use for this operation, it implies
> > records are filtered where this operator actually issues updates and
> > retractions. Also, deduplicate is already how we talk about this feature
> in
> > the docs so I think it would be easier for users to find.
> >
> > For null handling, I don't know how close we want to stick with SQL
> > conventions but what about making `coalesce` a top-level method?
> Something
> > like:
> >
> > myTable.coalesce($("a"), 1).as("a")
> >
> > We can require the next method to be an `as`. There is already precedent
> > for this sort of thing, `GroupedTable#aggregate` can only be followed by
> > `select`.
> >
> > Seth
> >
> > On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <we...@gmail.com> wrote:
> >
> >> Hi Dian,
> >>
> >> Big +1 for making the Table API easier to use. Java users and Python
> users
> >> can both benefit from it. I think it would be better if we add some
> Python
> >> API examples.
> >>
> >> Best,
> >> Wei
> >>
> >>
> >>> 在 2021年1月4日,20:03,Dian Fu <di...@gmail.com> 写道:
> >>>
> >>> Hi all,
> >>>
> >>> I'd like to start a discussion about introducing a few convenient
> >> operations in Table API from the perspective of ease of use.
> >>>
> >>> Currently some tasks are not easy to express in Table API e.g.
> >> deduplication, topn, etc, or not easy to express when there are
> hundreds of
> >> columns in a table, e.g. null data handling, etc.
> >>>
> >>> I'd like to propose to introduce a few operations in Table API with the
> >> following purposes:
> >>> - Make Table API users to easily leverage the powerful features already
> >> in SQL, e.g. deduplication, topn, etc
> >>> - Provide some convenient operations, e.g. introducing a series of
> >> operations for null data handling (it may become a problem when there
> are
> >> hundreds of columns), data sampling and splitting (which is a very
> common
> >> use case in ML which usually needs to split a table into multiple tables
> >> for training and validation separately).
> >>>
> >>> Please refer to FLIP-155 [1] for more details.
> >>>
> >>> Looking forward to your feedback!
> >>>
> >>> Regards,
> >>> Dian
> >>>
> >>> [1]
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API
> >>
> >>
> >
>
>

Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Posted by Timo Walther <tw...@apache.org>.
Hi Dian,

thanks for the proposed FLIP. I haven't taken a deep look at the 
proposal yet but will do so shortly. In general, we should aim to make 
the Table API as concise and self-explaining as possible. E.g. `dropna` 
does not sound obvious to me.

Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing 
more top-level functions, maybe we should also consider introducing more 
building blocks e.g. for applying an expression to every column. A more 
functional approach (e.g. with lamba function) could solve more use cases.

Regards,
Timo

On 04.01.21 15:35, Seth Wiesman wrote:
> This makes sense, I have some questions about method names.
> 
> What do you think about renaming `dropDuplicates` to `deduplicate`? I don't
> think that drop is the right word to use for this operation, it implies
> records are filtered where this operator actually issues updates and
> retractions. Also, deduplicate is already how we talk about this feature in
> the docs so I think it would be easier for users to find.
> 
> For null handling, I don't know how close we want to stick with SQL
> conventions but what about making `coalesce` a top-level method? Something
> like:
> 
> myTable.coalesce($("a"), 1).as("a")
> 
> We can require the next method to be an `as`. There is already precedent
> for this sort of thing, `GroupedTable#aggregate` can only be followed by
> `select`.
> 
> Seth
> 
> On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <we...@gmail.com> wrote:
> 
>> Hi Dian,
>>
>> Big +1 for making the Table API easier to use. Java users and Python users
>> can both benefit from it. I think it would be better if we add some Python
>> API examples.
>>
>> Best,
>> Wei
>>
>>
>>> 在 2021年1月4日,20:03,Dian Fu <di...@gmail.com> 写道:
>>>
>>> Hi all,
>>>
>>> I'd like to start a discussion about introducing a few convenient
>> operations in Table API from the perspective of ease of use.
>>>
>>> Currently some tasks are not easy to express in Table API e.g.
>> deduplication, topn, etc, or not easy to express when there are hundreds of
>> columns in a table, e.g. null data handling, etc.
>>>
>>> I'd like to propose to introduce a few operations in Table API with the
>> following purposes:
>>> - Make Table API users to easily leverage the powerful features already
>> in SQL, e.g. deduplication, topn, etc
>>> - Provide some convenient operations, e.g. introducing a series of
>> operations for null data handling (it may become a problem when there are
>> hundreds of columns), data sampling and splitting (which is a very common
>> use case in ML which usually needs to split a table into multiple tables
>> for training and validation separately).
>>>
>>> Please refer to FLIP-155 [1] for more details.
>>>
>>> Looking forward to your feedback!
>>>
>>> Regards,
>>> Dian
>>>
>>> [1]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API
>>
>>
> 


Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Posted by Seth Wiesman <sj...@gmail.com>.
This makes sense, I have some questions about method names.

What do you think about renaming `dropDuplicates` to `deduplicate`? I don't
think that drop is the right word to use for this operation, it implies
records are filtered where this operator actually issues updates and
retractions. Also, deduplicate is already how we talk about this feature in
the docs so I think it would be easier for users to find.

For null handling, I don't know how close we want to stick with SQL
conventions but what about making `coalesce` a top-level method? Something
like:

myTable.coalesce($("a"), 1).as("a")

We can require the next method to be an `as`. There is already precedent
for this sort of thing, `GroupedTable#aggregate` can only be followed by
`select`.

Seth

On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <we...@gmail.com> wrote:

> Hi Dian,
>
> Big +1 for making the Table API easier to use. Java users and Python users
> can both benefit from it. I think it would be better if we add some Python
> API examples.
>
> Best,
> Wei
>
>
> > 在 2021年1月4日,20:03,Dian Fu <di...@gmail.com> 写道:
> >
> > Hi all,
> >
> > I'd like to start a discussion about introducing a few convenient
> operations in Table API from the perspective of ease of use.
> >
> > Currently some tasks are not easy to express in Table API e.g.
> deduplication, topn, etc, or not easy to express when there are hundreds of
> columns in a table, e.g. null data handling, etc.
> >
> > I'd like to propose to introduce a few operations in Table API with the
> following purposes:
> > - Make Table API users to easily leverage the powerful features already
> in SQL, e.g. deduplication, topn, etc
> > - Provide some convenient operations, e.g. introducing a series of
> operations for null data handling (it may become a problem when there are
> hundreds of columns), data sampling and splitting (which is a very common
> use case in ML which usually needs to split a table into multiple tables
> for training and validation separately).
> >
> > Please refer to FLIP-155 [1] for more details.
> >
> > Looking forward to your feedback!
> >
> > Regards,
> > Dian
> >
> > [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API
>
>

Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Posted by Wei Zhong <we...@gmail.com>.
Hi Dian,

Big +1 for making the Table API easier to use. Java users and Python users can both benefit from it. I think it would be better if we add some Python API examples. 

Best,
Wei


> 在 2021年1月4日,20:03,Dian Fu <di...@gmail.com> 写道:
> 
> Hi all,
> 
> I'd like to start a discussion about introducing a few convenient operations in Table API from the perspective of ease of use. 
> 
> Currently some tasks are not easy to express in Table API e.g. deduplication, topn, etc, or not easy to express when there are hundreds of columns in a table, e.g. null data handling, etc.
> 
> I'd like to propose to introduce a few operations in Table API with the following purposes:
> - Make Table API users to easily leverage the powerful features already in SQL, e.g. deduplication, topn, etc
> - Provide some convenient operations, e.g. introducing a series of operations for null data handling (it may become a problem when there are hundreds of columns), data sampling and splitting (which is a very common use case in ML which usually needs to split a table into multiple tables for training and validation separately).
> 
> Please refer to FLIP-155 [1] for more details.
> 
> Looking forward to your feedback!
> 
> Regards,
> Dian
> 
> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API