You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by simple <10...@qq.com> on 2023/02/24 11:49:16 UTC

Re: RE: Re: Re: Should we always mark ValueState as "transient" forRichFunctions

稍等,我看一下您的反馈



发自我的iPhone


------------------ Original ------------------
From: Zhongpu Chen <chenloveit@gmail.com&gt;
Date: Fri,Feb 24,2023 7:47 PM
To: user <user@flink.apache.org&gt;
Subject: Re: RE: Re: Re: Should we always mark ValueState as "transient" forRichFunctions



Hi Shammon,

Sorry for the inaccurate description of my last reply. Let me restate my 
question again:

Fact 1: we know that ValueState here should not 
serialized/de-serialized, so it is a good practice to mark it with 
"transient".

Fact 2: on the other hand, if we don't mark it with "transient", it will 
be initialized to null, and this null value will be 
serialized/de-serialized. I think it will occur some overhead if the 
number of partitions is very large.

Given the two facts above, the program works well in both cases in terms 
of accuracy. And my question is: is there any performance benchmark in 
real (large) applications to compare two cases?

Feel free to point out if I've misunderstood.

On 2023/02/24 11:01:51 Shammon FY wrote:
 &gt; Hi
 &gt;
 &gt; Sorry that I don't quite understand your question. I think the above
 &gt; functions will only be deserialized when the job is submitted, do you 
want
 &gt; to test the impact of this on submission throughput?
 &gt;
 &gt; Best,
 &gt; Shammon
 &gt;
 &gt;
 &gt; On Fri, Feb 24, 2023 at 3:04 PM Zhongpu Chen  wrote:
 &gt;
 &gt; &gt; Hi Gen,
 &gt; &gt;
 &gt; &gt; Thanks for your explanation.
 &gt; &gt;
 &gt; &gt; Back to this code snippet, since they are not marked with "transient"
 &gt; &gt; now, I suppose Flink will use avro to serialize them (null values). Is
 &gt; &gt; there any benchmark to show the performance test between null values
 &gt; &gt; serialization and "transient"? I mean, it is indeed not good to write
 &gt; &gt; them with "transient", but it works. So is there any performance 
lose here?
 &gt; &gt;
 &gt; &gt;
 &gt; &gt; On 2023/02/24 06:47:21 Gen Luo wrote:
 &gt; &gt; &gt; Hi,
 &gt; &gt; &gt;
 &gt; &gt; &gt; ValueState is a handle rather than an actual value. So it should 
never
 &gt; &gt; be
 &gt; &gt; &gt; serialized. In fact, ValueState itself is not a Serializable. It
 &gt; &gt; should be
 &gt; &gt; &gt; ok to always mark it as transient.
 &gt; &gt; &gt;
 &gt; &gt; &gt; In this case, I suppose it works because the ValueState is not set
 &gt; &gt; (which
 &gt; &gt; &gt; happens during the runtime) when the function is serialized (while
 &gt; &gt; &gt; deploying). But it's not good.
 &gt; &gt; &gt;
 &gt; &gt; &gt; On Fri, Feb 24, 2023 at 10:29 AM Zhongpu Chen  
wrote:
 &gt; &gt; &gt;
 &gt; &gt; &gt; &gt; Hi,
 &gt; &gt; &gt; &gt;
 &gt; &gt; &gt; &gt; When I am reading the code from flink-training-repo [1], I 
noticed the
 &gt; &gt; &gt; &gt; following code:
 &gt; &gt; &gt; &gt;
 &gt; &gt; &gt; &gt; ```java
 &gt; &gt; &gt; &gt;
 &gt; &gt; &gt; &gt; public static class EnrichmentFunction
 &gt; &gt; &gt; &gt; extends RichCoFlatMapFunction {
 &gt; &gt; &gt; &gt;
 &gt; &gt; &gt; &gt; private ValueState rideState; private
 &gt; &gt; ValueState fareState;
 &gt; &gt; &gt; &gt; ...
 &gt; &gt; &gt; &gt; }
 &gt; &gt; &gt; &gt;
 &gt; &gt; &gt; &gt; ```
 &gt; &gt; &gt; &gt;
 &gt; &gt; &gt; &gt; From my understanding, since ValueState variables here are scoped
 &gt; &gt; to each
 &gt; &gt; &gt; &gt; instance, they should not be serialized for the performance sake.
 &gt; &gt; Thus, we
 &gt; &gt; &gt; &gt; should always mark them with "transient". Similar discussion can be
 &gt; &gt; found
 &gt; &gt; &gt; &gt; here [2].
 &gt; &gt; &gt; &gt;
 &gt; &gt; &gt; &gt; Should we always mark ValueState as "transient", and why? Please
 &gt; &gt; help me
 &gt; &gt; &gt; &gt; to figure it out.
 &gt; &gt; &gt; &gt;
 &gt; &gt; &gt; &gt; [1]
 &gt; &gt; &gt; &gt;
 &gt; &gt;
 &gt; &gt; 
https://github.com/apache/flink-training/blob/master/rides-and-fares/src/solution/java/org/apache/flink/training/solutions/ridesandfares/RidesAndFaresSolution.java
 &gt; &gt; &gt; &gt;
 &gt; &gt; &gt; &gt; [2]
 &gt; &gt; &gt; &gt;
 &gt; &gt;
 &gt; &gt; 
https://stackoverflow.com/questions/72556202/flink-managed-state-as-transient
 &gt; &gt; &gt; &gt;
 &gt; &gt; &gt;
 &gt; &gt;
 &gt;

Re: RE: Re: Re: Should we always mark ValueState as "transient" forRichFunctions

Posted by Shammon FY <zj...@gmail.com>.
Hi

I think there is no benchmark about this at present

Best,
Shammon

On Sun, Feb 26, 2023 at 5:49 PM Gen Luo <lu...@gmail.com> wrote:

> Hi,
>
> I suppose there are two things to clarify:
> 1. the function object will only be serialized once when deploying and
> deserialized once per task while initializing.
> 2. the ValueState itself is only a handle. It is set/setup with the key of
> each incoming record. When serializing, it is only a null field.
>
> In conclusion, the cost of serialization should be constant,
> relatively ignorable, and there should be no impact on the performance no
> matter whether it is a transient.
> By the way, preparing, accessing and updating the value state can indeed
> be costly if the number of partitions is really large, but it has nothing
> to do with the `transient` qualifier.
>
> On Fri, Feb 24, 2023 at 7:49 PM simple <10...@qq.com> wrote:
>
>> 稍等,我看一下您的反馈
>>
>> ------------------------------
>> 发自我的iPhone
>>
>>
>> ------------------ Original ------------------
>> *From:* Zhongpu Chen <ch...@gmail.com>
>> *Date:* Fri,Feb 24,2023 7:47 PM
>> *To:* user <us...@flink.apache.org>
>> *Subject:* Re: RE: Re: Re: Should we always mark ValueState as
>> "transient" forRichFunctions
>>
>> Hi Shammon,
>>
>> Sorry for the inaccurate description of my last reply. Let me restate my
>> question again:
>>
>> Fact 1: we know that ValueState here should not
>> serialized/de-serialized, so it is a good practice to mark it with
>> "transient".
>>
>> Fact 2: on the other hand, if we don't mark it with "transient", it will
>> be initialized to null, and this null value will be
>> serialized/de-serialized. I think it will occur some overhead if the
>> number of partitions is very large.
>>
>> Given the two facts above, the program works well in both cases in terms
>> of accuracy. And my question is: is there any performance benchmark in
>> real (large) applications to compare two cases?
>>
>> Feel free to point out if I've misunderstood.
>>
>> On 2023/02/24 11:01:51 Shammon FY wrote:
>> > Hi
>> >
>> > Sorry that I don't quite understand your question. I think the above
>> > functions will only be deserialized when the job is submitted, do you
>> want
>> > to test the impact of this on submission throughput?
>> >
>> > Best,
>> > Shammon
>> >
>> >
>> > On Fri, Feb 24, 2023 at 3:04 PM Zhongpu Chen wrote:
>> >
>> > > Hi Gen,
>> > >
>> > > Thanks for your explanation.
>> > >
>> > > Back to this code snippet, since they are not marked with "transient"
>> > > now, I suppose Flink will use avro to serialize them (null values). Is
>> > > there any benchmark to show the performance test between null values
>> > > serialization and "transient"? I mean, it is indeed not good to write
>> > > them with "transient", but it works. So is there any performance
>> lose here?
>> > >
>> > >
>> > > On 2023/02/24 06:47:21 Gen Luo wrote:
>> > > > Hi,
>> > > >
>> > > > ValueState is a handle rather than an actual value. So it should
>> never
>> > > be
>> > > > serialized. In fact, ValueState itself is not a Serializable. It
>> > > should be
>> > > > ok to always mark it as transient.
>> > > >
>> > > > In this case, I suppose it works because the ValueState is not set
>> > > (which
>> > > > happens during the runtime) when the function is serialized (while
>> > > > deploying). But it's not good.
>> > > >
>> > > > On Fri, Feb 24, 2023 at 10:29 AM Zhongpu Chen
>> wrote:
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > > When I am reading the code from flink-training-repo [1], I
>> noticed the
>> > > > > following code:
>> > > > >
>> > > > > ```java
>> > > > >
>> > > > > public static class EnrichmentFunction
>> > > > > extends RichCoFlatMapFunction {
>> > > > >
>> > > > > private ValueState rideState; private
>> > > ValueState fareState;
>> > > > > ...
>> > > > > }
>> > > > >
>> > > > > ```
>> > > > >
>> > > > > From my understanding, since ValueState variables here are scoped
>> > > to each
>> > > > > instance, they should not be serialized for the performance sake.
>> > > Thus, we
>> > > > > should always mark them with "transient". Similar discussion can
>> be
>> > > found
>> > > > > here [2].
>> > > > >
>> > > > > Should we always mark ValueState as "transient", and why? Please
>> > > help me
>> > > > > to figure it out.
>> > > > >
>> > > > > [1]
>> > > > >
>> > >
>> > >
>>
>> https://github.com/apache/flink-training/blob/master/rides-and-fares/src/solution/java/org/apache/flink/training/solutions/ridesandfares/RidesAndFaresSolution.java
>> > > > >
>> > > > > [2]
>> > > > >
>> > >
>> > >
>>
>> https://stackoverflow.com/questions/72556202/flink-managed-state-as-transient
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: RE: Re: Re: Should we always mark ValueState as "transient" forRichFunctions

Posted by Gen Luo <lu...@gmail.com>.
Hi,

I suppose there are two things to clarify:
1. the function object will only be serialized once when deploying and
deserialized once per task while initializing.
2. the ValueState itself is only a handle. It is set/setup with the key of
each incoming record. When serializing, it is only a null field.

In conclusion, the cost of serialization should be constant,
relatively ignorable, and there should be no impact on the performance no
matter whether it is a transient.
By the way, preparing, accessing and updating the value state can indeed be
costly if the number of partitions is really large, but it has nothing to
do with the `transient` qualifier.

On Fri, Feb 24, 2023 at 7:49 PM simple <10...@qq.com> wrote:

> 稍等,我看一下您的反馈
>
> ------------------------------
> 发自我的iPhone
>
>
> ------------------ Original ------------------
> *From:* Zhongpu Chen <ch...@gmail.com>
> *Date:* Fri,Feb 24,2023 7:47 PM
> *To:* user <us...@flink.apache.org>
> *Subject:* Re: RE: Re: Re: Should we always mark ValueState as
> "transient" forRichFunctions
>
> Hi Shammon,
>
> Sorry for the inaccurate description of my last reply. Let me restate my
> question again:
>
> Fact 1: we know that ValueState here should not
> serialized/de-serialized, so it is a good practice to mark it with
> "transient".
>
> Fact 2: on the other hand, if we don't mark it with "transient", it will
> be initialized to null, and this null value will be
> serialized/de-serialized. I think it will occur some overhead if the
> number of partitions is very large.
>
> Given the two facts above, the program works well in both cases in terms
> of accuracy. And my question is: is there any performance benchmark in
> real (large) applications to compare two cases?
>
> Feel free to point out if I've misunderstood.
>
> On 2023/02/24 11:01:51 Shammon FY wrote:
> > Hi
> >
> > Sorry that I don't quite understand your question. I think the above
> > functions will only be deserialized when the job is submitted, do you
> want
> > to test the impact of this on submission throughput?
> >
> > Best,
> > Shammon
> >
> >
> > On Fri, Feb 24, 2023 at 3:04 PM Zhongpu Chen wrote:
> >
> > > Hi Gen,
> > >
> > > Thanks for your explanation.
> > >
> > > Back to this code snippet, since they are not marked with "transient"
> > > now, I suppose Flink will use avro to serialize them (null values). Is
> > > there any benchmark to show the performance test between null values
> > > serialization and "transient"? I mean, it is indeed not good to write
> > > them with "transient", but it works. So is there any performance
> lose here?
> > >
> > >
> > > On 2023/02/24 06:47:21 Gen Luo wrote:
> > > > Hi,
> > > >
> > > > ValueState is a handle rather than an actual value. So it should
> never
> > > be
> > > > serialized. In fact, ValueState itself is not a Serializable. It
> > > should be
> > > > ok to always mark it as transient.
> > > >
> > > > In this case, I suppose it works because the ValueState is not set
> > > (which
> > > > happens during the runtime) when the function is serialized (while
> > > > deploying). But it's not good.
> > > >
> > > > On Fri, Feb 24, 2023 at 10:29 AM Zhongpu Chen
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > When I am reading the code from flink-training-repo [1], I
> noticed the
> > > > > following code:
> > > > >
> > > > > ```java
> > > > >
> > > > > public static class EnrichmentFunction
> > > > > extends RichCoFlatMapFunction {
> > > > >
> > > > > private ValueState rideState; private
> > > ValueState fareState;
> > > > > ...
> > > > > }
> > > > >
> > > > > ```
> > > > >
> > > > > From my understanding, since ValueState variables here are scoped
> > > to each
> > > > > instance, they should not be serialized for the performance sake.
> > > Thus, we
> > > > > should always mark them with "transient". Similar discussion can be
> > > found
> > > > > here [2].
> > > > >
> > > > > Should we always mark ValueState as "transient", and why? Please
> > > help me
> > > > > to figure it out.
> > > > >
> > > > > [1]
> > > > >
> > >
> > >
>
> https://github.com/apache/flink-training/blob/master/rides-and-fares/src/solution/java/org/apache/flink/training/solutions/ridesandfares/RidesAndFaresSolution.java
> > > > >
> > > > > [2]
> > > > >
> > >
> > >
>
> https://stackoverflow.com/questions/72556202/flink-managed-state-as-transient
> > > > >
> > > >
> > >
> >
>