You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-dev@hadoop.apache.org by Gyori Andras <ga...@cloudera.com.INVALID> on 2022/02/28 15:55:23 UTC

[DISCUSS] Race condition in ProtobufRpcEngine2

Hey everyone!

We have started seeing test failures in YARN PRs for a while. We have
identified the problematic commit, which is HADOOP-18082
<https://issues.apache.org/jira/browse/HADOOP-18082>, however, this change
just revealed the race condition lying in ProtobufRpcEngine2 introduced in
HADOOP-17046 <https://issues.apache.org/jira/browse/HADOOP-17046>. We have
also fixed the underlying issue via a locking mechanism, presented in
HADOOP-18143 <https://issues.apache.org/jira/browse/HADOOP-18143>, but
since it is out of our area of expertise, we can neither verify nor
guarantee that it will not cause some subtle issues in the RPC system.
As we think it is a core part of Hadoop, we would use feedback from someone
who is proficient in this part.

Regards:
Andras

Re: [DISCUSS] Race condition in ProtobufRpcEngine2

Posted by Vinayakumar B <vi...@apache.org>.

Call.toString() was used only while sending the response, in the response
processor thread. At this point, header would be already read and
processed. So no chance of race conditions.

This race condition was not introduced by the HADOOP-17046
<https://issues.apache.org/jira/browse/HADOOP-17046>, instead existing code
was never intended to be used for logging purposes by multiple threads.
Though the locking mechanism proposed in HADOOP-18143
<https://issues.apache.org/jira/browse/HADOOP-18143> solves the problems,
it adds overhead in both memory as well as performance.

This is a critical path for RPC processing, and adding a
locking/synchronization in this place would affect the overall performance
(may not be huge).

I would advise to remove logging and locking both altogether and leave the
code as is in Reader thread.

If logging is required, then add it in the response processor thread or the
handler threads.

-Vinay

On Mon, Feb 28, 2022 at 10:02 PM Ayush Saxena <ay...@gmail.com> wrote:

> Hey Andras,
> I had a quick look at HADOOP-18143, the methods in question in
> ProtobufRpcEngine2 are identical to the ones in ProtobufRpcEngine. So, I am
> not very sure how the  race condition doesn't happen in  ProtobufRpcEngine.
> I have to debug and spend some more time, considering that I have
> reverted HADOOP-18082 for now to unblock YARN. Though the issue would still
> be there as you said, but will give us some time to analyse.
>
> Thanks
> -Ayush
>
> On Mon, 28 Feb 2022 at 21:26, Gyori Andras <ga...@cloudera.com.invalid>
> wrote:
>
>> Hey everyone!
>>
>> We have started seeing test failures in YARN PRs for a while. We have
>> identified the problematic commit, which is HADOOP-18082
>> <https://issues.apache.org/jira/browse/HADOOP-18082>, however, this
>> change
>> just revealed the race condition lying in ProtobufRpcEngine2 introduced in
>> HADOOP-17046 <https://issues.apache.org/jira/browse/HADOOP-17046>. We
>> have
>> also fixed the underlying issue via a locking mechanism, presented in
>> HADOOP-18143 <https://issues.apache.org/jira/browse/HADOOP-18143>, but
>> since it is out of our area of expertise, we can neither verify nor
>> guarantee that it will not cause some subtle issues in the RPC system.
>> As we think it is a core part of Hadoop, we would use feedback from
>> someone
>> who is proficient in this part.
>>
>> Regards:
>> Andras
>>
>

Re: [DISCUSS] Race condition in ProtobufRpcEngine2

Posted by Vinayakumar B <vi...@apache.org>.

Call.toString() was used only while sending the response, in the response
processor thread. At this point, header would be already read and
processed. So no chance of race conditions.

This race condition was not introduced by the HADOOP-17046
<https://issues.apache.org/jira/browse/HADOOP-17046>, instead existing code
was never intended to be used for logging purposes by multiple threads.
Though the locking mechanism proposed in HADOOP-18143
<https://issues.apache.org/jira/browse/HADOOP-18143> solves the problems,
it adds overhead in both memory as well as performance.

This is a critical path for RPC processing, and adding a
locking/synchronization in this place would affect the overall performance
(may not be huge).

I would advise to remove logging and locking both altogether and leave the
code as is in Reader thread.

If logging is required, then add it in the response processor thread or the
handler threads.

-Vinay

On Mon, Feb 28, 2022 at 10:02 PM Ayush Saxena <ay...@gmail.com> wrote:

> Hey Andras,
> I had a quick look at HADOOP-18143, the methods in question in
> ProtobufRpcEngine2 are identical to the ones in ProtobufRpcEngine. So, I am
> not very sure how the  race condition doesn't happen in  ProtobufRpcEngine.
> I have to debug and spend some more time, considering that I have
> reverted HADOOP-18082 for now to unblock YARN. Though the issue would still
> be there as you said, but will give us some time to analyse.
>
> Thanks
> -Ayush
>
> On Mon, 28 Feb 2022 at 21:26, Gyori Andras <ga...@cloudera.com.invalid>
> wrote:
>
>> Hey everyone!
>>
>> We have started seeing test failures in YARN PRs for a while. We have
>> identified the problematic commit, which is HADOOP-18082
>> <https://issues.apache.org/jira/browse/HADOOP-18082>, however, this
>> change
>> just revealed the race condition lying in ProtobufRpcEngine2 introduced in
>> HADOOP-17046 <https://issues.apache.org/jira/browse/HADOOP-17046>. We
>> have
>> also fixed the underlying issue via a locking mechanism, presented in
>> HADOOP-18143 <https://issues.apache.org/jira/browse/HADOOP-18143>, but
>> since it is out of our area of expertise, we can neither verify nor
>> guarantee that it will not cause some subtle issues in the RPC system.
>> As we think it is a core part of Hadoop, we would use feedback from
>> someone
>> who is proficient in this part.
>>
>> Regards:
>> Andras
>>
>

Re: [DISCUSS] Race condition in ProtobufRpcEngine2

Posted by Vinayakumar B <vi...@apache.org>.

Call.toString() was used only while sending the response, in the response
processor thread. At this point, header would be already read and
processed. So no chance of race conditions.

This race condition was not introduced by the HADOOP-17046
<https://issues.apache.org/jira/browse/HADOOP-17046>, instead existing code
was never intended to be used for logging purposes by multiple threads.
Though the locking mechanism proposed in HADOOP-18143
<https://issues.apache.org/jira/browse/HADOOP-18143> solves the problems,
it adds overhead in both memory as well as performance.

This is a critical path for RPC processing, and adding a
locking/synchronization in this place would affect the overall performance
(may not be huge).

I would advise to remove logging and locking both altogether and leave the
code as is in Reader thread.

If logging is required, then add it in the response processor thread or the
handler threads.

-Vinay

On Mon, Feb 28, 2022 at 10:02 PM Ayush Saxena <ay...@gmail.com> wrote:

> Hey Andras,
> I had a quick look at HADOOP-18143, the methods in question in
> ProtobufRpcEngine2 are identical to the ones in ProtobufRpcEngine. So, I am
> not very sure how the  race condition doesn't happen in  ProtobufRpcEngine.
> I have to debug and spend some more time, considering that I have
> reverted HADOOP-18082 for now to unblock YARN. Though the issue would still
> be there as you said, but will give us some time to analyse.
>
> Thanks
> -Ayush
>
> On Mon, 28 Feb 2022 at 21:26, Gyori Andras <ga...@cloudera.com.invalid>
> wrote:
>
>> Hey everyone!
>>
>> We have started seeing test failures in YARN PRs for a while. We have
>> identified the problematic commit, which is HADOOP-18082
>> <https://issues.apache.org/jira/browse/HADOOP-18082>, however, this
>> change
>> just revealed the race condition lying in ProtobufRpcEngine2 introduced in
>> HADOOP-17046 <https://issues.apache.org/jira/browse/HADOOP-17046>. We
>> have
>> also fixed the underlying issue via a locking mechanism, presented in
>> HADOOP-18143 <https://issues.apache.org/jira/browse/HADOOP-18143>, but
>> since it is out of our area of expertise, we can neither verify nor
>> guarantee that it will not cause some subtle issues in the RPC system.
>> As we think it is a core part of Hadoop, we would use feedback from
>> someone
>> who is proficient in this part.
>>
>> Regards:
>> Andras
>>
>

Re: [DISCUSS] Race condition in ProtobufRpcEngine2

Posted by Vinayakumar B <vi...@apache.org>.

Call.toString() was used only while sending the response, in the response
processor thread. At this point, header would be already read and
processed. So no chance of race conditions.

This race condition was not introduced by the HADOOP-17046
<https://issues.apache.org/jira/browse/HADOOP-17046>, instead existing code
was never intended to be used for logging purposes by multiple threads.
Though the locking mechanism proposed in HADOOP-18143
<https://issues.apache.org/jira/browse/HADOOP-18143> solves the problems,
it adds overhead in both memory as well as performance.

This is a critical path for RPC processing, and adding a
locking/synchronization in this place would affect the overall performance
(may not be huge).

I would advise to remove logging and locking both altogether and leave the
code as is in Reader thread.

If logging is required, then add it in the response processor thread or the
handler threads.

-Vinay

On Mon, Feb 28, 2022 at 10:02 PM Ayush Saxena <ay...@gmail.com> wrote:

> Hey Andras,
> I had a quick look at HADOOP-18143, the methods in question in
> ProtobufRpcEngine2 are identical to the ones in ProtobufRpcEngine. So, I am
> not very sure how the  race condition doesn't happen in  ProtobufRpcEngine.
> I have to debug and spend some more time, considering that I have
> reverted HADOOP-18082 for now to unblock YARN. Though the issue would still
> be there as you said, but will give us some time to analyse.
>
> Thanks
> -Ayush
>
> On Mon, 28 Feb 2022 at 21:26, Gyori Andras <ga...@cloudera.com.invalid>
> wrote:
>
>> Hey everyone!
>>
>> We have started seeing test failures in YARN PRs for a while. We have
>> identified the problematic commit, which is HADOOP-18082
>> <https://issues.apache.org/jira/browse/HADOOP-18082>, however, this
>> change
>> just revealed the race condition lying in ProtobufRpcEngine2 introduced in
>> HADOOP-17046 <https://issues.apache.org/jira/browse/HADOOP-17046>. We
>> have
>> also fixed the underlying issue via a locking mechanism, presented in
>> HADOOP-18143 <https://issues.apache.org/jira/browse/HADOOP-18143>, but
>> since it is out of our area of expertise, we can neither verify nor
>> guarantee that it will not cause some subtle issues in the RPC system.
>> As we think it is a core part of Hadoop, we would use feedback from
>> someone
>> who is proficient in this part.
>>
>> Regards:
>> Andras
>>
>

Re: [DISCUSS] Race condition in ProtobufRpcEngine2

Posted by Vinayakumar B <vi...@apache.org>.

Call.toString() was used only while sending the response, in the response
processor thread. At this point, header would be already read and
processed. So no chance of race conditions.

This race condition was not introduced by the HADOOP-17046
<https://issues.apache.org/jira/browse/HADOOP-17046>, instead existing code
was never intended to be used for logging purposes by multiple threads.
Though the locking mechanism proposed in HADOOP-18143
<https://issues.apache.org/jira/browse/HADOOP-18143> solves the problems,
it adds overhead in both memory as well as performance.

This is a critical path for RPC processing, and adding a
locking/synchronization in this place would affect the overall performance
(may not be huge).

I would advise to remove logging and locking both altogether and leave the
code as is in Reader thread.

If logging is required, then add it in the response processor thread or the
handler threads.

-Vinay

On Mon, Feb 28, 2022 at 10:02 PM Ayush Saxena <ay...@gmail.com> wrote:

> Hey Andras,
> I had a quick look at HADOOP-18143, the methods in question in
> ProtobufRpcEngine2 are identical to the ones in ProtobufRpcEngine. So, I am
> not very sure how the  race condition doesn't happen in  ProtobufRpcEngine.
> I have to debug and spend some more time, considering that I have
> reverted HADOOP-18082 for now to unblock YARN. Though the issue would still
> be there as you said, but will give us some time to analyse.
>
> Thanks
> -Ayush
>
> On Mon, 28 Feb 2022 at 21:26, Gyori Andras <ga...@cloudera.com.invalid>
> wrote:
>
>> Hey everyone!
>>
>> We have started seeing test failures in YARN PRs for a while. We have
>> identified the problematic commit, which is HADOOP-18082
>> <https://issues.apache.org/jira/browse/HADOOP-18082>, however, this
>> change
>> just revealed the race condition lying in ProtobufRpcEngine2 introduced in
>> HADOOP-17046 <https://issues.apache.org/jira/browse/HADOOP-17046>. We
>> have
>> also fixed the underlying issue via a locking mechanism, presented in
>> HADOOP-18143 <https://issues.apache.org/jira/browse/HADOOP-18143>, but
>> since it is out of our area of expertise, we can neither verify nor
>> guarantee that it will not cause some subtle issues in the RPC system.
>> As we think it is a core part of Hadoop, we would use feedback from
>> someone
>> who is proficient in this part.
>>
>> Regards:
>> Andras
>>
>

Re: [DISCUSS] Race condition in ProtobufRpcEngine2

Posted by Ayush Saxena <ay...@gmail.com>.

Hey Andras,
I had a quick look at HADOOP-18143, the methods in question in
ProtobufRpcEngine2 are identical to the ones in ProtobufRpcEngine. So, I am
not very sure how the  race condition doesn't happen in  ProtobufRpcEngine.
I have to debug and spend some more time, considering that I have
reverted HADOOP-18082 for now to unblock YARN. Though the issue would still
be there as you said, but will give us some time to analyse.

Thanks
-Ayush

On Mon, 28 Feb 2022 at 21:26, Gyori Andras <ga...@cloudera.com.invalid>
wrote:

> Hey everyone!
>
> We have started seeing test failures in YARN PRs for a while. We have
> identified the problematic commit, which is HADOOP-18082
> <https://issues.apache.org/jira/browse/HADOOP-18082>, however, this change
> just revealed the race condition lying in ProtobufRpcEngine2 introduced in
> HADOOP-17046 <https://issues.apache.org/jira/browse/HADOOP-17046>. We have
> also fixed the underlying issue via a locking mechanism, presented in
> HADOOP-18143 <https://issues.apache.org/jira/browse/HADOOP-18143>, but
> since it is out of our area of expertise, we can neither verify nor
> guarantee that it will not cause some subtle issues in the RPC system.
> As we think it is a core part of Hadoop, we would use feedback from someone
> who is proficient in this part.
>
> Regards:
> Andras
>

Re: [DISCUSS] Race condition in ProtobufRpcEngine2

Posted by Ayush Saxena <ay...@gmail.com>.

Hey Andras,
I had a quick look at HADOOP-18143, the methods in question in
ProtobufRpcEngine2 are identical to the ones in ProtobufRpcEngine. So, I am
not very sure how the  race condition doesn't happen in  ProtobufRpcEngine.
I have to debug and spend some more time, considering that I have
reverted HADOOP-18082 for now to unblock YARN. Though the issue would still
be there as you said, but will give us some time to analyse.

Thanks
-Ayush

On Mon, 28 Feb 2022 at 21:26, Gyori Andras <ga...@cloudera.com.invalid>
wrote:

> Hey everyone!
>
> We have started seeing test failures in YARN PRs for a while. We have
> identified the problematic commit, which is HADOOP-18082
> <https://issues.apache.org/jira/browse/HADOOP-18082>, however, this change
> just revealed the race condition lying in ProtobufRpcEngine2 introduced in
> HADOOP-17046 <https://issues.apache.org/jira/browse/HADOOP-17046>. We have
> also fixed the underlying issue via a locking mechanism, presented in
> HADOOP-18143 <https://issues.apache.org/jira/browse/HADOOP-18143>, but
> since it is out of our area of expertise, we can neither verify nor
> guarantee that it will not cause some subtle issues in the RPC system.
> As we think it is a core part of Hadoop, we would use feedback from someone
> who is proficient in this part.
>
> Regards:
> Andras
>

Re: [DISCUSS] Race condition in ProtobufRpcEngine2

Posted by Ayush Saxena <ay...@gmail.com>.

Hey Andras,
I had a quick look at HADOOP-18143, the methods in question in
ProtobufRpcEngine2 are identical to the ones in ProtobufRpcEngine. So, I am
not very sure how the  race condition doesn't happen in  ProtobufRpcEngine.
I have to debug and spend some more time, considering that I have
reverted HADOOP-18082 for now to unblock YARN. Though the issue would still
be there as you said, but will give us some time to analyse.

Thanks
-Ayush

On Mon, 28 Feb 2022 at 21:26, Gyori Andras <ga...@cloudera.com.invalid>
wrote:

> Hey everyone!
>
> We have started seeing test failures in YARN PRs for a while. We have
> identified the problematic commit, which is HADOOP-18082
> <https://issues.apache.org/jira/browse/HADOOP-18082>, however, this change
> just revealed the race condition lying in ProtobufRpcEngine2 introduced in
> HADOOP-17046 <https://issues.apache.org/jira/browse/HADOOP-17046>. We have
> also fixed the underlying issue via a locking mechanism, presented in
> HADOOP-18143 <https://issues.apache.org/jira/browse/HADOOP-18143>, but
> since it is out of our area of expertise, we can neither verify nor
> guarantee that it will not cause some subtle issues in the RPC system.
> As we think it is a core part of Hadoop, we would use feedback from someone
> who is proficient in this part.
>
> Regards:
> Andras
>

Re: [DISCUSS] Race condition in ProtobufRpcEngine2

Posted by Ayush Saxena <ay...@gmail.com>.

Hey Andras,
I had a quick look at HADOOP-18143, the methods in question in
ProtobufRpcEngine2 are identical to the ones in ProtobufRpcEngine. So, I am
not very sure how the  race condition doesn't happen in  ProtobufRpcEngine.
I have to debug and spend some more time, considering that I have
reverted HADOOP-18082 for now to unblock YARN. Though the issue would still
be there as you said, but will give us some time to analyse.

Thanks
-Ayush

On Mon, 28 Feb 2022 at 21:26, Gyori Andras <ga...@cloudera.com.invalid>
wrote:

> Hey everyone!
>
> We have started seeing test failures in YARN PRs for a while. We have
> identified the problematic commit, which is HADOOP-18082
> <https://issues.apache.org/jira/browse/HADOOP-18082>, however, this change
> just revealed the race condition lying in ProtobufRpcEngine2 introduced in
> HADOOP-17046 <https://issues.apache.org/jira/browse/HADOOP-17046>. We have
> also fixed the underlying issue via a locking mechanism, presented in
> HADOOP-18143 <https://issues.apache.org/jira/browse/HADOOP-18143>, but
> since it is out of our area of expertise, we can neither verify nor
> guarantee that it will not cause some subtle issues in the RPC system.
> As we think it is a core part of Hadoop, we would use feedback from someone
> who is proficient in this part.
>
> Regards:
> Andras
>

Re: [DISCUSS] Race condition in ProtobufRpcEngine2

Posted by Ayush Saxena <ay...@gmail.com>.

Hey Andras,
I had a quick look at HADOOP-18143, the methods in question in
ProtobufRpcEngine2 are identical to the ones in ProtobufRpcEngine. So, I am
not very sure how the  race condition doesn't happen in  ProtobufRpcEngine.
I have to debug and spend some more time, considering that I have
reverted HADOOP-18082 for now to unblock YARN. Though the issue would still
be there as you said, but will give us some time to analyse.

Thanks
-Ayush

On Mon, 28 Feb 2022 at 21:26, Gyori Andras <ga...@cloudera.com.invalid>
wrote:

> Hey everyone!
>
> We have started seeing test failures in YARN PRs for a while. We have
> identified the problematic commit, which is HADOOP-18082
> <https://issues.apache.org/jira/browse/HADOOP-18082>, however, this change
> just revealed the race condition lying in ProtobufRpcEngine2 introduced in
> HADOOP-17046 <https://issues.apache.org/jira/browse/HADOOP-17046>. We have
> also fixed the underlying issue via a locking mechanism, presented in
> HADOOP-18143 <https://issues.apache.org/jira/browse/HADOOP-18143>, but
> since it is out of our area of expertise, we can neither verify nor
> guarantee that it will not cause some subtle issues in the RPC system.
> As we think it is a core part of Hadoop, we would use feedback from someone
> who is proficient in this part.
>
> Regards:
> Andras
>