You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Valentyn Tymofieiev <va...@google.com> on 2021/09/08 23:56:11 UTC

Re: Upgrading vendored gRPC from 1.26.0 to 1.36.0

On Wed, May 19, 2021 at 7:21 PM Tomo Suzuki <su...@google.com> wrote:

> Update:
> I just merged Kiley's https://github.com/apache/beam/pull/14833, in which
> I tried several "Run Java Precommit" and didn't observe the logging test
> (BeamFnLoggingServiceTest) failures. Let's see how the builds go.
>
I believe the error is still happening, filed
https://issues.apache.org/jira/browse/BEAM-12859.


>
>
> Kenn, Ismaël, and Kiley,
> Thank you for the help and follow-up!
>
>
> On Thu, May 13, 2021 at 10:39 AM Tomo Suzuki <su...@google.com> wrote:
>
>> I'm giving up! Can anyone troubleshoot this gRPC concurrency problem
>> further?
>> My current view of the problem (link
>> <https://github.com/apache/beam/pull/14768#issuecomment-840576342>) is
>> that "grpc-default-executor" threads stop processing the data. But I cannot
>> tell why.
>>
>> I also raised an question to grpc-java on how best to troubleshoot such
>> situation
>> https://github.com/grpc/grpc-java/issues/8174
>>
>> On Wed, May 12, 2021 at 11:29 PM Tomo Suzuki <su...@google.com> wrote:
>>
>>> Update: still the root cause of is unknown.
>>>
>>> From my observation with debug logging and thread dump,
>>> "grpc-default-executor-XXX" threads disappear when the problematic tests
>>> become hung.
>>> More notes:
>>> https://github.com/apache/beam/pull/14768#issuecomment-840228795
>>>
>>> Interestingly the "grpc-default-executor-XXX" threads reappear in the
>>> logs when the pause triggers a 5-second timeout set by JUnit.
>>>
>>>
>>> On Tue, May 11, 2021 at 1:12 PM Tomo Suzuki <su...@google.com> wrote:
>>>
>>>> Thank you for the advice. Yes, the latch not being counted-down is the
>>>> problem. (my memo:
>>>> https://github.com/apache/beam/pull/14474#discussion_r619557479 ) I'll
>>>> need to figure out why withOnError is not called.
>>>>
>>>>
>>>> > Can you repro locally?
>>>>
>>>> No, the task succeeds in my environment (./gradlew
>>>> :runners:google-cloud-dataflow-java:worker:test).
>>>>
>>>>
>>>> On Tue, May 11, 2021 at 12:34 PM Kenneth Knowles <ke...@apache.org>
>>>> wrote:
>>>>
>>>>> I am not sure how much you read the code of the test. So apologies if
>>>>> I am saying things you already know. The test does something like:
>>>>>
>>>>>  - start a logging service
>>>>>  - set up some stub clients, each with onError wired up to release a
>>>>> countdown latch
>>>>>  - send error responses to all three of them (actually it sends the
>>>>> error in the same task it creates the stub)
>>>>>  - each task waits on the latch
>>>>>
>>>>> So if onError does not deliver or does not call to release the
>>>>> countdown latch, it will hang. I notice in the gist you provide that all
>>>>> three stub clients are hung awaiting the latch. That is suspicious to me. I
>>>>> would want to confirm if the flakiness always occurs in a way that hangs
>>>>> all three. Then there are gRPC workers waiting on empty queues, and the
>>>>> main test thread waiting for the hung tasks to complete.
>>>>>
>>>>> The problem could be something about the test set up. Personally I
>>>>> would add a ton of logs, or potentially use a debugger, to confirm exactly
>>>>> the state of things when it hangs. Can you repro locally? I think this same
>>>>> functionality could be tested in different ways that might remove some of
>>>>> the variables. For example starting up all the waiting tasks, then sending
>>>>> all the onError messages that should cause them to terminate.
>>>>>
>>>>> Since this is a unit test, adding a timeout to just that method should
>>>>> save time (but will make it harder to capture stack traces, etc). I've
>>>>> opened up https://github.com/apache/beam/pull/14781 for that. There
>>>>> may be a nice way to add a timeout to the executor to capture the hung
>>>>> stack, but I didn't look for it.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Tue, May 11, 2021 at 7:36 AM Tomo Suzuki <su...@google.com>
>>>>> wrote:
>>>>>
>>>>>> gRPC 1.37.0 showed the same problem:
>>>>>> BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
>>>>>> waits tasks forever, causing timeout in Java precommit.
>>>>>>
>>>>>> While I continue my investigation, I appreciate if someone knows the
>>>>>> cause of the problem, I pasted the thread dump of the Java process when the
>>>>>> test was frozen:
>>>>>> https://github.com/apache/beam/pull/14768
>>>>>>
>>>>>> If this mystery is never solved, vendoring (a bit old) gRPC 1.32.2
>>>>>> without the jboss dependencies is an alternate option, (suggestion by Kenn;
>>>>>> memo
>>>>>> <https://issues.apache.org/jira/browse/BEAM-11227?focusedCommentId=17318238&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17318238>
>>>>>> )
>>>>>>
>>>>>> Regards,
>>>>>> Tomo
>>>>>>
>>>>>>
>>>>>> On Mon, May 10, 2021 at 9:40 AM Tomo Suzuki <su...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I was investigating the strange timeout (
>>>>>>> https://github.com/apache/beam/pull/14474) but was occupied with
>>>>>>> something else lately.
>>>>>>> Let me try the new version today to see any improvements.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 10, 2021 at 4:57 AM Ismaël Mejía <ie...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I just saw that gRPC 1.37.1 is out now (and with aarch64 support
>>>>>>>> for python!) that made me wonder about this, what is the current status of
>>>>>>>> upgrading the vendored dependency Tomo?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 8, 2021 at 4:16 PM Tomo Suzuki <su...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> We observed the cron job of Java Precommit for the master branch
>>>>>>>>> started timing out often (not always) since upgrading the gRPC version.
>>>>>>>>> https://github.com/apache/beam/pull/14466#issuecomment-815343974
>>>>>>>>>
>>>>>>>>> Exchanged messages with Kenn, I reverted to the change; now the
>>>>>>>>> master branch uses the vendored gRPC 1.26.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 31, 2021 at 11:40 AM Kenneth Knowles <ke...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Merged. Let's keep an eye for trouble, and I will incorporate to
>>>>>>>>>> the release branch.
>>>>>>>>>>
>>>>>>>>>> Kenn
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 31, 2021 at 6:45 AM Tomo Suzuki <su...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Regarding troubleshooting on build timeout, it seems that Docker
>>>>>>>>>>> cache in Jenkins machines might be playing a role. As I run more "Java
>>>>>>>>>>> Presubmit", I no longer observe timeouts in the PR.
>>>>>>>>>>>
>>>>>>>>>>> Kenn, would you merge the PR?
>>>>>>>>>>> https://github.com/apache/beam/pull/14295 (all checks green,
>>>>>>>>>>> including the new Java postcommit checks)
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 25, 2021 at 5:24 PM Kenneth Knowles <ke...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes, I agree this might be a good idea. This is not the only
>>>>>>>>>>>> major issue on the release-2.29.0 branch.
>>>>>>>>>>>>
>>>>>>>>>>>> The counter argument is that we will be pulling in all the bugs
>>>>>>>>>>>> introduced to `master` since the branch cut.
>>>>>>>>>>>>
>>>>>>>>>>>> As far as effort goes, I have been mostly focused on burning
>>>>>>>>>>>> down the bugs so I would not lose much work in the release process.
>>>>>>>>>>>>
>>>>>>>>>>>> Kenn
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Mar 25, 2021 at 1:42 PM Ismaël Mejía <ie...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Precommit is quite unstable in the last days, so worth to
>>>>>>>>>>>>> check if
>>>>>>>>>>>>> something is wrong in the CI.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have a question Kenn. Given that cherry picking this might
>>>>>>>>>>>>> be a bit
>>>>>>>>>>>>> big as a change can we just reconsider cutting the 2.29.0
>>>>>>>>>>>>> branch again
>>>>>>>>>>>>> after the updated gRPC version use gets merged and mark the
>>>>>>>>>>>>> issues
>>>>>>>>>>>>> already fixed for version 2.30.0 to version 2.29.0 ? Seems
>>>>>>>>>>>>> like an
>>>>>>>>>>>>> easier upgrade path (and we will get some nice
>>>>>>>>>>>>> fixes/improvements like
>>>>>>>>>>>>> official Spark 3 support for free on the release).
>>>>>>>>>>>>>
>>>>>>>>>>>>> WDYT?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 24, 2021 at 8:06 PM Tomo Suzuki <
>>>>>>>>>>>>> suztomo@google.com> wrote:
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Update: I observe that Java precommit check is unstable in
>>>>>>>>>>>>> the PR to upgrade vendored gRPC (compared with an PR with an empty change).
>>>>>>>>>>>>> There's no constant failures; sometimes it succeeds and other times it
>>>>>>>>>>>>> faces timeout and flaky test failures.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> https://github.com/apache/beam/pull/14295#issuecomment-806071087
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > On Mon, Mar 22, 2021 at 10:46 AM Tomo Suzuki <
>>>>>>>>>>>>> suztomo@google.com> wrote:
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >> Thank you for the voting and I see the artifact available
>>>>>>>>>>>>> in Maven Central. I'll work on the PR to use the published artifact today.
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> https://search.maven.org/artifact/org.apache.beam/beam-vendor-grpc-1_36_0/0.1/jar
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >> On Tue, Mar 16, 2021 at 3:07 PM Kenneth Knowles <
>>>>>>>>>>>>> kenn@apache.org> wrote:
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>> Update on this: there are some minor issues and then I'll
>>>>>>>>>>>>> send out the RC.
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>> I think this is worth blocking 2.29.0 release on, so I
>>>>>>>>>>>>> will do this first. We are still eliminating other blockers from 2.29.0
>>>>>>>>>>>>> anyhow.
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>> Kenn
>>>>>>>>>>>>> >>>
>>>>>>>>>>>>> >>> On Mon, Mar 15, 2021 at 7:17 AM Tomo Suzuki <
>>>>>>>>>>>>> suztomo@google.com> wrote:
>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>> >>>> Hi Beam developers,
>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>> >>>> I'm working on upgrading the vendored gRPC 1.36.0
>>>>>>>>>>>>> >>>> https://issues.apache.org/jira/browse/BEAM-11227 (PR:
>>>>>>>>>>>>> https://github.com/apache/beam/pull/14028)
>>>>>>>>>>>>> >>>> Let me know if you have any questions or concerns.
>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>> >>>> Background:
>>>>>>>>>>>>> >>>> Exchanged messages with Ismaël in BEAM-11227, it seems
>>>>>>>>>>>>> that it the ticket created by some automation is false positive, but it's
>>>>>>>>>>>>> nice to use an artifact without being marked with CVE.
>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>> >>>> Kenn offered to work as the release manager (as in
>>>>>>>>>>>>> https://s.apache.org/beam-release-vendored-artifacts) of the
>>>>>>>>>>>>> vendored artifact.
>>>>>>>>>>>>> >>>>
>>>>>>>>>>>>> >>>> --
>>>>>>>>>>>>> >>>> Regards,
>>>>>>>>>>>>> >>>> Tomo
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >>
>>>>>>>>>>>>> >> --
>>>>>>>>>>>>> >> Regards,
>>>>>>>>>>>>> >> Tomo
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > --
>>>>>>>>>>>>> > Regards,
>>>>>>>>>>>>> > Tomo
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Regards,
>>>>>>>>>>> Tomo
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Regards,
>>>>>>>>> Tomo
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Tomo
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Tomo
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Tomo
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Tomo
>>>
>>
>>
>> --
>> Regards,
>> Tomo
>>
>
>
> --
> Regards,
> Tomo
>