You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by "Saravanan77@gmail.com" <sa...@gmail.com> on 2022/02/09 16:56:33 UTC

Flink 1.12.x DataSet --> Flink 1.14.x DataStream

I am trying to migrate from Flink 1.12.x DataSet api to Flink 1.14.x
DataStream api. mapPartition is not available in Flink DataStream.
*Current Code using Flink 1.12.x DataSet :*

dataset
    .<few operations>
    .mapPartition(new SomeMapParitionFn())
    .<few more operations>

public static class SomeMapPartitionFn extends
RichMapPartitionFunction<InputModel, OutputModel> {

    @Override
    public void mapPartition(Iterable<InputModel> records,
Collector<OutputModel> out) throws Exception {
        for (InputModel record : records) {
            /*
            do some operation
             */
            if (/* some condition based on processing *MULTIPLE*
records */) {*                out.collect(...); // Conditional collect
               ---> (1)*            }
        }

        // At the end of the data, collect*        out.collect(...);
// Collect processed data                   ---> (2) *    }
}


   -

   (1) - Collector.collect invoked based on some condition after processing
   few records
   -

   (2) - Collector.collect invoked at the end of data

   Initially we thought of using flatMap instead of mapPartition, but the
   collector is not available in close function.

   https://issues.apache.org/jira/browse/FLINK-14709 - Only available in
   case of chained drivers

How to implement this in Flink 1.14.x DataStream? Please advise...

*Note*: Our application works with only finite set of data (Batch Mode)

Re: Flink 1.12.x DataSet --> Flink 1.14.x DataStream

Posted by "Saravanan77@gmail.com" <sa...@gmail.com>.

Thanks Zhipeng.  Working as expected.  Thanks once again.

Saravanan

On Tue, Feb 15, 2022 at 3:23 AM Zhipeng Zhang <zh...@gmail.com>
wrote:

> Hi Saravanan,
>
> One solution could be using a streamOperator to implement `BoundedOneInput`
> interface.
> An example code could be found here [1].
>
> [1]
> https://github.com/apache/flink-ml/blob/56b441d85c3356c0ffedeef9c27969aee5b3ecfc/flink-ml-core/src/main/java/org/apache/flink/ml/common/datastream/DataStreamUtils.java#L75
>
> Saravanan77@gmail.com <sa...@gmail.com> 于2022年2月15日周二 02:44写道：
>
>> Hi Niklas,
>>
>> Thanks for your reply.  Approach [1] works only if operators are chained
>> (in order words, operators executed within the same task).   Since
>> mapPartition operator parallelism is different from previous operator
>> parallelism, it doesn't fall under the same task(or not chained) .
>>
>>
>>
>> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/flink-architecture/#tasks-and-operator-chains
>> https://issues.apache.org/jira/browse/FLINK-14709
>>
>> Saravanan
>>
>> On Mon, Feb 14, 2022 at 9:01 AM Niklas Semmler <ni...@ververica.com>
>> wrote:
>>
>>> Hi Saravanan,
>>>
>>> AFAIK the last record is not treated differently.
>>>
>>> Does the approach in [1] not work?
>>>
>>> Best regards,
>>> Niklas
>>>
>>>
>>> https://github.com/dmvk/flink/blob/2f1b573cd57e95ecac13c8c57c0356fb281fd753/flink-runtime/src/test/java/org/apache/flink/runtime/operators/chaining/ChainTaskTest.java#L279
>>>
>>>
>>> > On 9. Feb 2022, at 20:31, Saravanan77@gmail.com <sa...@gmail.com>
>>> wrote:
>>> >
>>> > Is there any way to identify the last message inside RichFunction in
>>> BATCH mode ?
>>> >
>>> >
>>> >
>>> > On Wed, Feb 9, 2022 at 8:56 AM Saravanan77@gmail.com <
>>> saravanan77@gmail.com> wrote:
>>> > I am trying to migrate from Flink 1.12.x DataSet api to Flink 1.14.x
>>> DataStream api. mapPartition is not available in Flink DataStream.
>>> >
>>> > Current Code using Flink 1.12.x DataSet :
>>> >
>>> > dataset
>>> >     .<few operations>
>>> >     .mapPartition(new SomeMapParitionFn())
>>> >     .<few more operations>
>>> >
>>> > public static class SomeMapPartitionFn extends
>>> RichMapPartitionFunction<InputModel, OutputModel> {
>>> >
>>> >     @Override
>>> >     public void mapPartition(Iterable<InputModel> records,
>>> Collector<OutputModel> out) throws Exception {
>>> >         for (InputModel record : records) {
>>> >             /*
>>> >             do some operation
>>> >              */
>>> >             if (/* some condition based on processing *MULTIPLE*
>>> records */) {
>>> >
>>> >                 out.collect(...); // Conditional collect
>>>   ---> (1)
>>> >             }
>>> >         }
>>> >
>>> >         // At the end of the data, collect
>>> >
>>> >         out.collect(...);   // Collect processed data
>>>  ---> (2)
>>> >     }
>>> > }
>>> >
>>> >       • (1) - Collector.collect invoked based on some condition after
>>> processing few records
>>> >       • (2) - Collector.collect invoked at the end of data
>>> >
>>> > Initially we thought of using flatMap instead of mapPartition, but the
>>> collector is not available in close function.
>>> >
>>> > https://issues.apache.org/jira/browse/FLINK-14709 - Only available in
>>> case of chained drivers
>>> > How to implement this in Flink 1.14.x DataStream? Please advise...
>>> >
>>> > Note: Our application works with only finite set of data (Batch Mode)
>>> >
>>>
>>>
>
> --
> best,
> Zhipeng
>
>

Re: Flink 1.12.x DataSet --> Flink 1.14.x DataStream

Posted by Zhipeng Zhang <zh...@gmail.com>.

Hi Saravanan,

One solution could be using a streamOperator to implement `BoundedOneInput`
interface.
An example code could be found here [1].

[1]
https://github.com/apache/flink-ml/blob/56b441d85c3356c0ffedeef9c27969aee5b3ecfc/flink-ml-core/src/main/java/org/apache/flink/ml/common/datastream/DataStreamUtils.java#L75

Saravanan77@gmail.com <sa...@gmail.com> 于2022年2月15日周二 02:44写道：

> Hi Niklas,
>
> Thanks for your reply.  Approach [1] works only if operators are chained
> (in order words, operators executed within the same task).   Since
> mapPartition operator parallelism is different from previous operator
> parallelism, it doesn't fall under the same task(or not chained) .
>
>
>
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/flink-architecture/#tasks-and-operator-chains
> https://issues.apache.org/jira/browse/FLINK-14709
>
> Saravanan
>
> On Mon, Feb 14, 2022 at 9:01 AM Niklas Semmler <ni...@ververica.com>
> wrote:
>
>> Hi Saravanan,
>>
>> AFAIK the last record is not treated differently.
>>
>> Does the approach in [1] not work?
>>
>> Best regards,
>> Niklas
>>
>>
>> https://github.com/dmvk/flink/blob/2f1b573cd57e95ecac13c8c57c0356fb281fd753/flink-runtime/src/test/java/org/apache/flink/runtime/operators/chaining/ChainTaskTest.java#L279
>>
>>
>> > On 9. Feb 2022, at 20:31, Saravanan77@gmail.com <sa...@gmail.com>
>> wrote:
>> >
>> > Is there any way to identify the last message inside RichFunction in
>> BATCH mode ?
>> >
>> >
>> >
>> > On Wed, Feb 9, 2022 at 8:56 AM Saravanan77@gmail.com <
>> saravanan77@gmail.com> wrote:
>> > I am trying to migrate from Flink 1.12.x DataSet api to Flink 1.14.x
>> DataStream api. mapPartition is not available in Flink DataStream.
>> >
>> > Current Code using Flink 1.12.x DataSet :
>> >
>> > dataset
>> >     .<few operations>
>> >     .mapPartition(new SomeMapParitionFn())
>> >     .<few more operations>
>> >
>> > public static class SomeMapPartitionFn extends
>> RichMapPartitionFunction<InputModel, OutputModel> {
>> >
>> >     @Override
>> >     public void mapPartition(Iterable<InputModel> records,
>> Collector<OutputModel> out) throws Exception {
>> >         for (InputModel record : records) {
>> >             /*
>> >             do some operation
>> >              */
>> >             if (/* some condition based on processing *MULTIPLE*
>> records */) {
>> >
>> >                 out.collect(...); // Conditional collect
>> ---> (1)
>> >             }
>> >         }
>> >
>> >         // At the end of the data, collect
>> >
>> >         out.collect(...);   // Collect processed data
>>  ---> (2)
>> >     }
>> > }
>> >
>> >       • (1) - Collector.collect invoked based on some condition after
>> processing few records
>> >       • (2) - Collector.collect invoked at the end of data
>> >
>> > Initially we thought of using flatMap instead of mapPartition, but the
>> collector is not available in close function.
>> >
>> > https://issues.apache.org/jira/browse/FLINK-14709 - Only available in
>> case of chained drivers
>> > How to implement this in Flink 1.14.x DataStream? Please advise...
>> >
>> > Note: Our application works with only finite set of data (Batch Mode)
>> >
>>
>>

-- 
best,
Zhipeng

Re: Flink 1.12.x DataSet --> Flink 1.14.x DataStream

Posted by "Saravanan77@gmail.com" <sa...@gmail.com>.

Hi Niklas,

Thanks for your reply.  Approach [1] works only if operators are chained
(in order words, operators executed within the same task).   Since
mapPartition operator parallelism is different from previous operator
parallelism, it doesn't fall under the same task(or not chained) .


https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/flink-architecture/#tasks-and-operator-chains
https://issues.apache.org/jira/browse/FLINK-14709

Saravanan

On Mon, Feb 14, 2022 at 9:01 AM Niklas Semmler <ni...@ververica.com> wrote:

> Hi Saravanan,
>
> AFAIK the last record is not treated differently.
>
> Does the approach in [1] not work?
>
> Best regards,
> Niklas
>
>
> https://github.com/dmvk/flink/blob/2f1b573cd57e95ecac13c8c57c0356fb281fd753/flink-runtime/src/test/java/org/apache/flink/runtime/operators/chaining/ChainTaskTest.java#L279
>
>
> > On 9. Feb 2022, at 20:31, Saravanan77@gmail.com <sa...@gmail.com>
> wrote:
> >
> > Is there any way to identify the last message inside RichFunction in
> BATCH mode ?
> >
> >
> >
> > On Wed, Feb 9, 2022 at 8:56 AM Saravanan77@gmail.com <
> saravanan77@gmail.com> wrote:
> > I am trying to migrate from Flink 1.12.x DataSet api to Flink 1.14.x
> DataStream api. mapPartition is not available in Flink DataStream.
> >
> > Current Code using Flink 1.12.x DataSet :
> >
> > dataset
> >     .<few operations>
> >     .mapPartition(new SomeMapParitionFn())
> >     .<few more operations>
> >
> > public static class SomeMapPartitionFn extends
> RichMapPartitionFunction<InputModel, OutputModel> {
> >
> >     @Override
> >     public void mapPartition(Iterable<InputModel> records,
> Collector<OutputModel> out) throws Exception {
> >         for (InputModel record : records) {
> >             /*
> >             do some operation
> >              */
> >             if (/* some condition based on processing *MULTIPLE* records
> */) {
> >
> >                 out.collect(...); // Conditional collect
> ---> (1)
> >             }
> >         }
> >
> >         // At the end of the data, collect
> >
> >         out.collect(...);   // Collect processed data
>  ---> (2)
> >     }
> > }
> >
> >       • (1) - Collector.collect invoked based on some condition after
> processing few records
> >       • (2) - Collector.collect invoked at the end of data
> >
> > Initially we thought of using flatMap instead of mapPartition, but the
> collector is not available in close function.
> >
> > https://issues.apache.org/jira/browse/FLINK-14709 - Only available in
> case of chained drivers
> > How to implement this in Flink 1.14.x DataStream? Please advise...
> >
> > Note: Our application works with only finite set of data (Batch Mode)
> >
>
>

Re: Flink 1.12.x DataSet --> Flink 1.14.x DataStream

Posted by Niklas Semmler <ni...@ververica.com>.

Hi Saravanan,

AFAIK the last record is not treated differently.

Does the approach in [1] not work? 

Best regards,
Niklas

https://github.com/dmvk/flink/blob/2f1b573cd57e95ecac13c8c57c0356fb281fd753/flink-runtime/src/test/java/org/apache/flink/runtime/operators/chaining/ChainTaskTest.java#L279


> On 9. Feb 2022, at 20:31, Saravanan77@gmail.com <sa...@gmail.com> wrote:
> 
> Is there any way to identify the last message inside RichFunction in BATCH mode ?
> 
> 
> 
> On Wed, Feb 9, 2022 at 8:56 AM Saravanan77@gmail.com <sa...@gmail.com> wrote:
> I am trying to migrate from Flink 1.12.x DataSet api to Flink 1.14.x DataStream api. mapPartition is not available in Flink DataStream.
> 
> Current Code using Flink 1.12.x DataSet :
> 
> dataset
>     .<few operations>
>     .mapPartition(new SomeMapParitionFn())
>     .<few more operations>
> 
> public static class SomeMapPartitionFn extends RichMapPartitionFunction<InputModel, OutputModel> {
> 
>     @Override
>     public void mapPartition(Iterable<InputModel> records, Collector<OutputModel> out) throws Exception {
>         for (InputModel record : records) {
>             /*
>             do some operation    
>              */
>             if (/* some condition based on processing *MULTIPLE* records */) {
> 
>                 out.collect(...); // Conditional collect                ---> (1)
>             }
>         }
>         
>         // At the end of the data, collect
> 
>         out.collect(...);   // Collect processed data                   ---> (2) 
>     }
> }
> 
> 	• (1) - Collector.collect invoked based on some condition after processing few records
> 	• (2) - Collector.collect invoked at the end of data
> 
> Initially we thought of using flatMap instead of mapPartition, but the collector is not available in close function.
> 
> https://issues.apache.org/jira/browse/FLINK-14709 - Only available in case of chained drivers
> How to implement this in Flink 1.14.x DataStream? Please advise...
> 
> Note: Our application works with only finite set of data (Batch Mode)
>

Re: Flink 1.12.x DataSet --> Flink 1.14.x DataStream

Posted by "Saravanan77@gmail.com" <sa...@gmail.com>.

Is there any way to identify the last message inside RichFunction in BATCH
mode ?



On Wed, Feb 9, 2022 at 8:56 AM Saravanan77@gmail.com <sa...@gmail.com>
wrote:

> I am trying to migrate from Flink 1.12.x DataSet api to Flink 1.14.x
> DataStream api. mapPartition is not available in Flink DataStream.
> *Current Code using Flink 1.12.x DataSet :*
>
> dataset
>     .<few operations>
>     .mapPartition(new SomeMapParitionFn())
>     .<few more operations>
>
> public static class SomeMapPartitionFn extends RichMapPartitionFunction<InputModel, OutputModel> {
>
>     @Override
>     public void mapPartition(Iterable<InputModel> records, Collector<OutputModel> out) throws Exception {
>         for (InputModel record : records) {
>             /*
>             do some operation
>              */
>             if (/* some condition based on processing *MULTIPLE* records */) {*                out.collect(...); // Conditional collect                ---> (1)*            }
>         }
>
>         // At the end of the data, collect*        out.collect(...);   // Collect processed data                   ---> (2) *    }
> }
>
>
>    -
>
>    (1) - Collector.collect invoked based on some condition after
>    processing few records
>    -
>
>    (2) - Collector.collect invoked at the end of data
>
>    Initially we thought of using flatMap instead of mapPartition, but the
>    collector is not available in close function.
>
>    https://issues.apache.org/jira/browse/FLINK-14709 - Only available in
>    case of chained drivers
>
> How to implement this in Flink 1.14.x DataStream? Please advise...
>
> *Note*: Our application works with only finite set of data (Batch Mode)
>
>
>