You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Daniel Oliveira <da...@google.com> on 2021/09/22 20:03:23 UTC

Best practices for upgrading installed dependencies on Jenkins VMs?

Hey everyone,

I'm aiming at upgrading the version of Go on our Jenkins VMs, and I found these
instructions on upgrading software on Jenkins
<https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers>
on
our cwiki.

I haven't started going through it yet, but I was wondering about the last
few steps that involve stopping VMs, deleting boot disks, and restarting
executors. Is there some best practice for that section to avoid causing
interruptions in our automated testing? Should I be trying to do this
outside of peak dev hours, or going one VM at a time so others can pick up
extra load, or anything like that?

Thanks,
Daniel Oliveira

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Valentyn Tymofieiev <va...@google.com>.
Sounds good, thanks.

On Thu, Jan 6, 2022 at 11:13 AM Daniela Martín <da...@wizeline.com>
wrote:

> Hi Valentyn,
>
> We decided to include the Java 17 installation in the image that we are
> creating for the Ubuntu upgrade (BEAM-12621). We are using the latest image
> j*enkins-worker-boot-image-20211029* that the Jenkins workers are
> currently using, so the remaining changes in this new image would be the
> ones that were made yesterday in *jenkins-worker-boot-image-20220105* image.
>
>
> We will create the new image later today, including the Ubuntu upgrade and
> Java SDK17 installation (which were previously implemented in
> *jenkins-worker-boot-image-20211214*), and let you know.
>
> Thank you.
>
> Regards,
>
> On Thu, Jan 6, 2022 at 10:01 AM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> Thanks, Daniela. I am happy to spot-check the new image you are building
>> for issues I am aware of.
>>
>> I made my changes to the latest VM image, building on top of latest
>> jenkins-worker-boot-image-20211214, and replicated those changes on the
>> running workers.
>>
>> I noticed that current Jenkins workers (at least some of them) are still
>> running on boot disks from older image jenkins-worker-boot-image-20211029,
>> and not the newest available image, jenkins-worker-boot-image-20211214.
>> Image comment for the latter image says: Installed Java SDK 17. See
>> BEAM-12313.
>>
>> I was wondering - is there a reason we did not reload Jenkins workers to
>> pick up this latest image? Or did you decide to upgrade to the new Ubuntu
>> version instead that would also include Java 17.
>>
>> If jenkins-worker-boot-image-20211214 is known to work and needed for
>> BEAM-12313 ~now, I can do this update, and we can continue to work in
>> parallel on BEAM-12621.
>>
>> Thanks,
>> Valentyn
>>
>
>
> --
>
> Daniela Martín (She/Her) | <https://www.wizeline.com/>
>
> Site Reliability Engineer
>
> daniela.martin@wizeline.com
>
> Amado Nervo 2200, Esfera P6, Col. Ciudad del Sol, 45050 Zapopan, Jal.
>
> Follow us Twitter <https://twitter.com/wizelineglobal> | Facebook
> <https://www.facebook.com/WizelineGlobal> | Instagram
> <https://www.instagram.com/wizelineglobal/> | LinkedIn
> <https://www.linkedin.com/company/wizeline>
>
> Share feedback on Clutch <https://clutch.co/review/submit/375119>
>
>
>
>
>
>
>
>
> *This email and its contents (including any attachments) are being sent
> toyou on the condition of confidentiality and may be protected by
> legalprivilege. Access to this email by anyone other than the intended
> recipientis unauthorized. If you are not the intended recipient, please
> immediatelynotify the sender by replying to this message and delete the
> materialimmediately from your system. Any further use, dissemination,
> distributionor reproduction of this email is strictly prohibited. Further,
> norepresentation is made with respect to any content contained in this
> email.*

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Daniela Martín <da...@wizeline.com>.
Hi Valentyn,

We decided to include the Java 17 installation in the image that we are
creating for the Ubuntu upgrade (BEAM-12621). We are using the latest image
j*enkins-worker-boot-image-20211029* that the Jenkins workers are currently
using, so the remaining changes in this new image would be the ones that
were made yesterday in *jenkins-worker-boot-image-20220105* image.

We will create the new image later today, including the Ubuntu upgrade and
Java SDK17 installation (which were previously implemented in
*jenkins-worker-boot-image-20211214*), and let you know.

Thank you.

Regards,

On Thu, Jan 6, 2022 at 10:01 AM Valentyn Tymofieiev <va...@google.com>
wrote:

> Thanks, Daniela. I am happy to spot-check the new image you are building
> for issues I am aware of.
>
> I made my changes to the latest VM image, building on top of latest
> jenkins-worker-boot-image-20211214, and replicated those changes on the
> running workers.
>
> I noticed that current Jenkins workers (at least some of them) are still
> running on boot disks from older image jenkins-worker-boot-image-20211029,
> and not the newest available image, jenkins-worker-boot-image-20211214.
> Image comment for the latter image says: Installed Java SDK 17. See
> BEAM-12313.
>
> I was wondering - is there a reason we did not reload Jenkins workers to
> pick up this latest image? Or did you decide to upgrade to the new Ubuntu
> version instead that would also include Java 17.
>
> If jenkins-worker-boot-image-20211214 is known to work and needed for
> BEAM-12313 ~now, I can do this update, and we can continue to work in
> parallel on BEAM-12621.
>
> Thanks,
> Valentyn
>


-- 

Daniela Martín (She/Her) | <https://www.wizeline.com/>

Site Reliability Engineer

daniela.martin@wizeline.com

Amado Nervo 2200, Esfera P6, Col. Ciudad del Sol, 45050 Zapopan, Jal.

Follow us Twitter <https://twitter.com/wizelineglobal> | Facebook
<https://www.facebook.com/WizelineGlobal> | Instagram
<https://www.instagram.com/wizelineglobal/> | LinkedIn
<https://www.linkedin.com/company/wizeline>

Share feedback on Clutch <https://clutch.co/review/submit/375119>

-- 
*This email and its contents (including any attachments) are being sent to
you on the condition of confidentiality and may be protected by legal
privilege. Access to this email by anyone other than the intended recipient
is unauthorized. If you are not the intended recipient, please immediately
notify the sender by replying to this message and delete the material
immediately from your system. Any further use, dissemination, distribution
or reproduction of this email is strictly prohibited. Further, no
representation is made with respect to any content contained in this email.*

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Valentyn Tymofieiev <va...@google.com>.
Thanks, Daniela. I am happy to spot-check the new image you are building
for issues I am aware of.

I made my changes to the latest VM image, building on top of latest
jenkins-worker-boot-image-20211214, and replicated those changes on the
running workers.

I noticed that current Jenkins workers (at least some of them) are still
running on boot disks from older image jenkins-worker-boot-image-20211029,
and not the newest available image, jenkins-worker-boot-image-20211214.
Image comment for the latter image says: Installed Java SDK 17. See
BEAM-12313.

I was wondering - is there a reason we did not reload Jenkins workers to
pick up this latest image? Or did you decide to upgrade to the new Ubuntu
version instead that would also include Java 17.

If jenkins-worker-boot-image-20211214 is known to work and needed for
BEAM-12313 ~now, I can do this update, and we can continue to work in
parallel on BEAM-12621.

Thanks,
Valentyn

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Robert Burke <ro...@frantil.com>.
Ack. Thanks for the headsup.

As long as Jenkins ends up with at least go1.16, the Go targets should be
fine.

On Wed, Jan 5, 2022, 6:08 PM Daniela Martín <da...@wizeline.com>
wrote:

> Hi Valentyn,
>
> Giomar and I are working on the upgrade of Jenkins VMs to modern Ubuntu
> version (BEAM-12621 <https://issues.apache.org/jira/browse/BEAM-12621>).
> We are very close to finishing it, we will reach out for the review.
>
> Thank you.
>
> Regards,
>
> On Wed, Jan 5, 2022 at 7:57 PM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> Heads-up, I am planning a Jenkins image upgrade with a minor change to
>> clean up some unwanted log4j artifacts from gradle caches to silence some
>> alerts I received. Hopefully noone else is currently doing an upgrade,
>> otherwise - please reach out.
>>
>> I will take care of of the PATH issue discussed here (if it is still an
>> issue).
>>
>>
>> On Tue, Nov 2, 2021 at 4:16 PM Robert Burke <ro...@frantil.com> wrote:
>>
>>> TIL as well. Sounds like the right location. Thanks Valentyn!
>>>
>>> On Tue, Nov 2, 2021, 11:00 AM Valentyn Tymofieiev <va...@google.com>
>>> wrote:
>>>
>>>> Yeah,  .profile is only sourced by login shells. Adding the PATH in
>>>> .bashrc can be a workaround, but since .bashrc is executed every time a new
>>>> shell runs, PATH variable will be growing with every shell subprocess, so
>>>> several sources recommend .profile instead, which does not always work.
>>>> We should be able to fix this by updating  /etc/environment instead
>>>> (TIL).
>>>>
>>>> This is the current content:
>>>> cat /etc/environment
>>>>
>>>> PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Nov 1, 2021 at 10:50 AM Robert Burke <ro...@frantil.com>
>>>> wrote:
>>>>
>>>>> Looks like while .profile was edited to add in a PATH section pointing
>>>>> to /snap/bin (where go is now installed), it doesn't seem like .profile is
>>>>> executed by the jenkins login shells.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 29, 2021, 6:23 PM Valentyn Tymofieiev <va...@google.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 20, 2021 at 11:16 AM Valentyn Tymofieiev <
>>>>>> valentyn@google.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 20, 2021 at 11:12 AM Pablo Estrada <pa...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks everyone for investigating and documenting this. I'll use it
>>>>>>>> today : )
>>>>>>>>
>>>>>>> Dan may be also in the middle of doing this, please coordinate.
>>>>>>>
>>>>>>>>
>>>>>>>> ahem - maybe we should rename the image name/image family names
>>>>>>>> to jenkins-worker-boot-image ? Does anyone foresee issues if we do that?
>>>>>>>> Does jenkins depend on these names in some undocumented way?
>>>>>>>>
>>>>>>> +1. it should 'just work', need to update the wiki after the change.
>>>>>>> Jenkins also did a terminology adjustment.
>>>>>>>
>>>>>> I had to reimage Jenkins workers again, took care of the rename and
>>>>>> changed the instructions.
>>>>>>
>>>>>> I am not sure what is the status of Go Postcommit problem, but
>>>>>> noticed that jenkins worker #1 had a different boot disk. I reimaged all
>>>>>> workers building on top of the latest image from the image family. If Go
>>>>>> tests start failing, we may need to get help from Dan again.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> On Tue, Oct 19, 2021 at 1:43 PM Daniel Oliveira <
>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>
>>>>>>>>> I'm ok with deciding to avoid the "lite" update option, feel free
>>>>>>>>> to revise the instructions as it seems appropriate. As for the issue, I
>>>>>>>>> fixed it with a workaround that should work until we need to add a new
>>>>>>>>> image to the agents, and I'm currently investigating the root cause and
>>>>>>>>> prepare a fixed image.
>>>>>>>>>
>>>>>>>>> That said, I think this issue would have still happened even if we
>>>>>>>>> didn't perform the "lite" update. I'm still trying to figure out the exact
>>>>>>>>> problem, but it looks to be a PATH issue that wasn't effectively caught by
>>>>>>>>> the current process. I won't get into details too much in this thread (see
>>>>>>>>> the Jira for that), but essentially everything works in my environment when
>>>>>>>>> I SSH into the VMs, but because the location of the "go" command changed in
>>>>>>>>> the PATH, it seems to have stopped working for every other user, including
>>>>>>>>> the Jenkins agents. I actually did notice that would happen when I was
>>>>>>>>> working on the image, but the solution seemed to be to reboot the machine,
>>>>>>>>> which I assumed happened already since I shut down the VM to image it.
>>>>>>>>>
>>>>>>>>> On Tue, Oct 19, 2021 at 12:09 PM Robert Burke <ro...@frantil.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1 to only having one way to do things. The Lite option seems
>>>>>>>>>> liable to cause more problems since it means it's changes can be blown away
>>>>>>>>>> if a new image isn't prepared anyway.
>>>>>>>>>> I don't think we are changing the images often enough for it.
>>>>>>>>>> Perhaps call it the option to test changes if anything?
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 19, 2021, 11:55 AM Valentyn Tymofieiev <
>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> All workers were updated to
>>>>>>>>>>> use jenkins-slave-boot-image-20211011, which should have had a go command,
>>>>>>>>>>> but it appears slightly misconfigured. I reopened BEAM-13037 [1] and added
>>>>>>>>>>> some details there.
>>>>>>>>>>>
>>>>>>>>>>> I also added instructions to wiki [2] on how to perform an image
>>>>>>>>>>> swap and it is actually very straightforward. I think a lesson here is that
>>>>>>>>>>> making 'lite' upgrades is brittle as misconfigurations could resurface down
>>>>>>>>>>> the road when the context of the lite upgrade is no longer fresh in our
>>>>>>>>>>> memory.
>>>>>>>>>>>
>>>>>>>>>>> I suggest we revise the instructions to keep only image swap
>>>>>>>>>>> commands and remove the 'lite' update option. +Daniel Oliveira
>>>>>>>>>>> <da...@google.com>, WDYT?  In the meantime, we should
>>>>>>>>>>> also prepare an image that fixes the misconfiguration. Would you be able to
>>>>>>>>>>> help with that? Thank you.
>>>>>>>>>>>
>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-13037
>>>>>>>>>>> [2]
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <ro...@frantil.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> FYI it looks like all the Go tests are now failing because it
>>>>>>>>>>>> can't find the Go command at all.
>>>>>>>>>>>> Did a Jenkins image without Go (v1.16+) pre-installed get
>>>>>>>>>>>> pushed?
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev <
>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks Daniel,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I can recreate the VMs on new disks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We currently have a set of stopped jenkins workers (named:
>>>>>>>>>>>>> apache-beam-jenkins-##) and running workers (named:
>>>>>>>>>>>>> apache-ci-beam-jenkins-##)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Are there any concerns about deleting the stopped group of
>>>>>>>>>>>>> workers?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you Daniel, Valentyn!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <
>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I performed a light update of both Go and Python (from
>>>>>>>>>>>>>>> Valentyn's update) on each worker VM over the weekend. I also added
>>>>>>>>>>>>>>> additional instructions for the light update to Confluence (as an
>>>>>>>>>>>>>>> alternative to the current instructions).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There is still reason to perform a full update at some
>>>>>>>>>>>>>>> point: Valentyn updated the VM image from 500 GB to 1000 GB of storage,
>>>>>>>>>>>>>>> which requires a full update to actually take effect.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <
>>>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > 3. SSH into the agent and perform the update.
>>>>>>>>>>>>>>>> So, this would be a 'lite' version of the update, where we
>>>>>>>>>>>>>>>> make changes to the live worker without recreating worker VM with a new
>>>>>>>>>>>>>>>> image? We could perhaps document both options, and also make it clear that
>>>>>>>>>>>>>>>> producing a VM image that has necessary updates is mandatory even if we
>>>>>>>>>>>>>>>> perform 'lite' updates without recreating the worker.
>>>>>>>>>>>>>>>> Also, for a lite update, marking the Jenkins offer offline
>>>>>>>>>>>>>>>> may be optional, as some updates might not be disruptive (such as
>>>>>>>>>>>>>>>> installing some software that will not be used immediately).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <
>>>>>>>>>>>>>>>> robert@frantil.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> SGTM. Thank you very much Daniel!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <
>>>>>>>>>>>>>>>>> altay@google.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thank you Daniel. Could you please update the wiki once
>>>>>>>>>>>>>>>>>> you are done with the process?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <
>>>>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Took me a bit to get to this, sorry. I finally figured
>>>>>>>>>>>>>>>>>>> out an approach for updating Go and did so and will be updating the image
>>>>>>>>>>>>>>>>>>> momentarily.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think a more important note is that I tried what
>>>>>>>>>>>>>>>>>>> Valentyn was considering, which is SSHing into workers and updating the
>>>>>>>>>>>>>>>>>>> dependency. I'll describe the process below, but the summary is that I did
>>>>>>>>>>>>>>>>>>> it on one worker with Go so far, saw no problems over the weekend, and
>>>>>>>>>>>>>>>>>>> would like to continue updating the rest of the workers if there are no
>>>>>>>>>>>>>>>>>>> objections.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Here's a step-by-step of what I did. If we decide to
>>>>>>>>>>>>>>>>>>> stick with this approach, these instructions can be added to Confluence:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. Go to the page for the Jenkins agent you want to
>>>>>>>>>>>>>>>>>>> update [1] and click "Mark this node temporarily offline", leaving a reason
>>>>>>>>>>>>>>>>>>> such as "Updating X dependency."
>>>>>>>>>>>>>>>>>>> 2. Wait until there are no more tests running in that
>>>>>>>>>>>>>>>>>>> agent (under "Build Executor Status" on the left of the page).
>>>>>>>>>>>>>>>>>>> 3. SSH into the agent and perform the update.
>>>>>>>>>>>>>>>>>>> 4. Mark the node as online again.
>>>>>>>>>>>>>>>>>>> 5. Repeat for every worker.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> And these are some additional steps if you want to
>>>>>>>>>>>>>>>>>>> immediately run a test suite to check that the update worked correctly. For
>>>>>>>>>>>>>>>>>>> example in my case, I wanted to check against the Go Postcommit, and it was
>>>>>>>>>>>>>>>>>>> a good thing I did, because it actually failed the first time and I had to
>>>>>>>>>>>>>>>>>>> go back in to fix a small oversight I made. So doing this after you update
>>>>>>>>>>>>>>>>>>> your first worker is probably a good idea before updating the rest:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. Go to the page for the job you want to run (for
>>>>>>>>>>>>>>>>>>> example: [2]).
>>>>>>>>>>>>>>>>>>> 2. Click "Configure" on the left menu.
>>>>>>>>>>>>>>>>>>> 3. Find the checkmark "Restrict where this project can
>>>>>>>>>>>>>>>>>>> be run" and change the restriction from "beam" to the specific name of the
>>>>>>>>>>>>>>>>>>> agent (ex. "apache-beam-jenkins-1").
>>>>>>>>>>>>>>>>>>> 4. Save and apply that change.
>>>>>>>>>>>>>>>>>>> 5. Back on the page for the job, click "Build with
>>>>>>>>>>>>>>>>>>> Parameters" on the left menu.
>>>>>>>>>>>>>>>>>>> 6. Run the build on "master".
>>>>>>>>>>>>>>>>>>> 7. Once you're done checking the results, change
>>>>>>>>>>>>>>>>>>> the restriction for the job back to "beam". (This also gets reset once
>>>>>>>>>>>>>>>>>>> every 24 hours in case you forget.)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on
>>>>>>>>>>>>>>>>>>> Friday evening when it wasn't too busy, and got Go updated and working. I
>>>>>>>>>>>>>>>>>>> checked that agent's execution history again today just in case, and it was
>>>>>>>>>>>>>>>>>>> healthy over the weekend, with no Go-related problems as far as I could
>>>>>>>>>>>>>>>>>>> see. If there's no objections I'd like to go ahead and continue updating
>>>>>>>>>>>>>>>>>>> the rest of the workers (I'll do this late at night or over the weekend to
>>>>>>>>>>>>>>>>>>> avoid disrupting dev work).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>>>>>>>>>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I updated the image in [1], but did not change the
>>>>>>>>>>>>>>>>>>>> workers yet to pick up the new image yet. We can do this once we add Go
>>>>>>>>>>>>>>>>>>>> changes on top of it.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I am also considering to SSH into every worker and run
>>>>>>>>>>>>>>>>>>>> a one-line command that adds the dependency that was missing. It seems to
>>>>>>>>>>>>>>>>>>>> be low risk, and  there is a fall-back plan to re-start the worker using
>>>>>>>>>>>>>>>>>>>> the saved image - both new and old images are saved and available in Cloud
>>>>>>>>>>>>>>>>>>>> Console.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Ideally, we should find a way to do a rolling upgrade
>>>>>>>>>>>>>>>>>>>> that a PMC or committer could trigger without logging into every machine.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> @Brian Hulette <bh...@google.com> That button
>>>>>>>>>>>>>>>>>>>>> seems like exactly what we'd need. Doing it manually would be a pain, but
>>>>>>>>>>>>>>>>>>>>> it's probably still preferable to causing a bunch of aborted tests.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating
>>>>>>>>>>>>>>>>>>>>> to do both updates at once is a great idea! I'll message you directly about
>>>>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I am also interested in this updating version of
>>>>>>>>>>>>>>>>>>>>>> Python on VMs, I need to install Python 3.9. Thanks for looking into this.
>>>>>>>>>>>>>>>>>>>>>> We can coordinate together to make one update instead of two.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <
>>>>>>>>>>>>>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I'm not sure about best practices here. Out of
>>>>>>>>>>>>>>>>>>>>>>> curiosity I just poked around in the Jenkins UI (e.g. [1]) and it looks
>>>>>>>>>>>>>>>>>>>>>>> like you can manually "Mark node temporarily offline" when logged in (if
>>>>>>>>>>>>>>>>>>>>>>> you're a committer). According to [2] this will prevent it from picking up
>>>>>>>>>>>>>>>>>>>>>>> new jobs after it's finished the currently executing ones. Doing that
>>>>>>>>>>>>>>>>>>>>>>> manually for every worker could be a pain though.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our
>>>>>>>>>>>>>>>>>>>>>>>> Jenkins VMs, and I found these instructions on
>>>>>>>>>>>>>>>>>>>>>>>> upgrading software on Jenkins
>>>>>>>>>>>>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>>>>>>>>>>>>>>>>>>> our cwiki.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I haven't started going through it yet, but I was
>>>>>>>>>>>>>>>>>>>>>>>> wondering about the last few steps that involve stopping VMs, deleting boot
>>>>>>>>>>>>>>>>>>>>>>>> disks, and restarting executors. Is there some best practice for
>>>>>>>>>>>>>>>>>>>>>>>> that section to avoid causing interruptions in our automated testing?
>>>>>>>>>>>>>>>>>>>>>>>> Should I be trying to do this outside of peak dev hours, or going one VM at
>>>>>>>>>>>>>>>>>>>>>>>> a time so others can pick up extra load, or anything like that?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>
> --
>
> Daniela Martín (She/Her) | <https://www.wizeline.com/>
>
> Site Reliability Engineer
>
> daniela.martin@wizeline.com
>
> Amado Nervo 2200, Esfera P6, Col. Ciudad del Sol, 45050 Zapopan, Jal.
>
> Follow us Twitter <https://twitter.com/wizelineglobal> | Facebook
> <https://www.facebook.com/WizelineGlobal> | Instagram
> <https://www.instagram.com/wizelineglobal/> | LinkedIn
> <https://www.linkedin.com/company/wizeline>
>
> Share feedback on Clutch <https://clutch.co/review/submit/375119>
>
>
>
>
>
>
>
>
> *This email and its contents (including any attachments) are being sent
> toyou on the condition of confidentiality and may be protected by
> legalprivilege. Access to this email by anyone other than the intended
> recipientis unauthorized. If you are not the intended recipient, please
> immediatelynotify the sender by replying to this message and delete the
> materialimmediately from your system. Any further use, dissemination,
> distributionor reproduction of this email is strictly prohibited. Further,
> norepresentation is made with respect to any content contained in this
> email.*

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Daniela Martín <da...@wizeline.com>.
Hi Valentyn,

Giomar and I are working on the upgrade of Jenkins VMs to modern Ubuntu
version (BEAM-12621 <https://issues.apache.org/jira/browse/BEAM-12621>). We
are very close to finishing it, we will reach out for the review.

Thank you.

Regards,

On Wed, Jan 5, 2022 at 7:57 PM Valentyn Tymofieiev <va...@google.com>
wrote:

> Heads-up, I am planning a Jenkins image upgrade with a minor change to
> clean up some unwanted log4j artifacts from gradle caches to silence some
> alerts I received. Hopefully noone else is currently doing an upgrade,
> otherwise - please reach out.
>
> I will take care of of the PATH issue discussed here (if it is still an
> issue).
>
>
> On Tue, Nov 2, 2021 at 4:16 PM Robert Burke <ro...@frantil.com> wrote:
>
>> TIL as well. Sounds like the right location. Thanks Valentyn!
>>
>> On Tue, Nov 2, 2021, 11:00 AM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>> Yeah,  .profile is only sourced by login shells. Adding the PATH in
>>> .bashrc can be a workaround, but since .bashrc is executed every time a new
>>> shell runs, PATH variable will be growing with every shell subprocess, so
>>> several sources recommend .profile instead, which does not always work.
>>> We should be able to fix this by updating  /etc/environment instead
>>> (TIL).
>>>
>>> This is the current content:
>>> cat /etc/environment
>>>
>>> PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"
>>>
>>>
>>>
>>>
>>> On Mon, Nov 1, 2021 at 10:50 AM Robert Burke <ro...@frantil.com> wrote:
>>>
>>>> Looks like while .profile was edited to add in a PATH section pointing
>>>> to /snap/bin (where go is now installed), it doesn't seem like .profile is
>>>> executed by the jenkins login shells.
>>>>
>>>>
>>>>
>>>> On Fri, Oct 29, 2021, 6:23 PM Valentyn Tymofieiev <va...@google.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Wed, Oct 20, 2021 at 11:16 AM Valentyn Tymofieiev <
>>>>> valentyn@google.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 20, 2021 at 11:12 AM Pablo Estrada <pa...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks everyone for investigating and documenting this. I'll use it
>>>>>>> today : )
>>>>>>>
>>>>>> Dan may be also in the middle of doing this, please coordinate.
>>>>>>
>>>>>>>
>>>>>>> ahem - maybe we should rename the image name/image family names
>>>>>>> to jenkins-worker-boot-image ? Does anyone foresee issues if we do that?
>>>>>>> Does jenkins depend on these names in some undocumented way?
>>>>>>>
>>>>>> +1. it should 'just work', need to update the wiki after the change.
>>>>>> Jenkins also did a terminology adjustment.
>>>>>>
>>>>> I had to reimage Jenkins workers again, took care of the rename and
>>>>> changed the instructions.
>>>>>
>>>>> I am not sure what is the status of Go Postcommit problem, but noticed
>>>>> that jenkins worker #1 had a different boot disk. I reimaged all workers
>>>>> building on top of the latest image from the image family. If Go tests
>>>>> start failing, we may need to get help from Dan again.
>>>>>
>>>>>
>>>>>>
>>>>>>> On Tue, Oct 19, 2021 at 1:43 PM Daniel Oliveira <
>>>>>>> danoliveira@google.com> wrote:
>>>>>>>
>>>>>>>> I'm ok with deciding to avoid the "lite" update option, feel free
>>>>>>>> to revise the instructions as it seems appropriate. As for the issue, I
>>>>>>>> fixed it with a workaround that should work until we need to add a new
>>>>>>>> image to the agents, and I'm currently investigating the root cause and
>>>>>>>> prepare a fixed image.
>>>>>>>>
>>>>>>>> That said, I think this issue would have still happened even if we
>>>>>>>> didn't perform the "lite" update. I'm still trying to figure out the exact
>>>>>>>> problem, but it looks to be a PATH issue that wasn't effectively caught by
>>>>>>>> the current process. I won't get into details too much in this thread (see
>>>>>>>> the Jira for that), but essentially everything works in my environment when
>>>>>>>> I SSH into the VMs, but because the location of the "go" command changed in
>>>>>>>> the PATH, it seems to have stopped working for every other user, including
>>>>>>>> the Jenkins agents. I actually did notice that would happen when I was
>>>>>>>> working on the image, but the solution seemed to be to reboot the machine,
>>>>>>>> which I assumed happened already since I shut down the VM to image it.
>>>>>>>>
>>>>>>>> On Tue, Oct 19, 2021 at 12:09 PM Robert Burke <ro...@frantil.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1 to only having one way to do things. The Lite option seems
>>>>>>>>> liable to cause more problems since it means it's changes can be blown away
>>>>>>>>> if a new image isn't prepared anyway.
>>>>>>>>> I don't think we are changing the images often enough for it.
>>>>>>>>> Perhaps call it the option to test changes if anything?
>>>>>>>>>
>>>>>>>>> On Tue, Oct 19, 2021, 11:55 AM Valentyn Tymofieiev <
>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> All workers were updated to
>>>>>>>>>> use jenkins-slave-boot-image-20211011, which should have had a go command,
>>>>>>>>>> but it appears slightly misconfigured. I reopened BEAM-13037 [1] and added
>>>>>>>>>> some details there.
>>>>>>>>>>
>>>>>>>>>> I also added instructions to wiki [2] on how to perform an image
>>>>>>>>>> swap and it is actually very straightforward. I think a lesson here is that
>>>>>>>>>> making 'lite' upgrades is brittle as misconfigurations could resurface down
>>>>>>>>>> the road when the context of the lite upgrade is no longer fresh in our
>>>>>>>>>> memory.
>>>>>>>>>>
>>>>>>>>>> I suggest we revise the instructions to keep only image swap
>>>>>>>>>> commands and remove the 'lite' update option. +Daniel Oliveira
>>>>>>>>>> <da...@google.com>, WDYT?  In the meantime, we should also
>>>>>>>>>> prepare an image that fixes the misconfiguration. Would you be able to help
>>>>>>>>>> with that? Thank you.
>>>>>>>>>>
>>>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-13037
>>>>>>>>>> [2]
>>>>>>>>>> https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <ro...@frantil.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> FYI it looks like all the Go tests are now failing because it
>>>>>>>>>>> can't find the Go command at all.
>>>>>>>>>>> Did a Jenkins image without Go (v1.16+) pre-installed get pushed?
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev <
>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Daniel,
>>>>>>>>>>>>
>>>>>>>>>>>> I can recreate the VMs on new disks.
>>>>>>>>>>>>
>>>>>>>>>>>> We currently have a set of stopped jenkins workers (named:
>>>>>>>>>>>> apache-beam-jenkins-##) and running workers (named:
>>>>>>>>>>>> apache-ci-beam-jenkins-##)
>>>>>>>>>>>>
>>>>>>>>>>>> Are there any concerns about deleting the stopped group of
>>>>>>>>>>>> workers?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you Daniel, Valentyn!
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <
>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I performed a light update of both Go and Python (from
>>>>>>>>>>>>>> Valentyn's update) on each worker VM over the weekend. I also added
>>>>>>>>>>>>>> additional instructions for the light update to Confluence (as an
>>>>>>>>>>>>>> alternative to the current instructions).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There is still reason to perform a full update at some point:
>>>>>>>>>>>>>> Valentyn updated the VM image from 500 GB to 1000 GB of storage, which
>>>>>>>>>>>>>> requires a full update to actually take effect.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <
>>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > 3. SSH into the agent and perform the update.
>>>>>>>>>>>>>>> So, this would be a 'lite' version of the update, where we
>>>>>>>>>>>>>>> make changes to the live worker without recreating worker VM with a new
>>>>>>>>>>>>>>> image? We could perhaps document both options, and also make it clear that
>>>>>>>>>>>>>>> producing a VM image that has necessary updates is mandatory even if we
>>>>>>>>>>>>>>> perform 'lite' updates without recreating the worker.
>>>>>>>>>>>>>>> Also, for a lite update, marking the Jenkins offer offline
>>>>>>>>>>>>>>> may be optional, as some updates might not be disruptive (such as
>>>>>>>>>>>>>>> installing some software that will not be used immediately).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <
>>>>>>>>>>>>>>> robert@frantil.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> SGTM. Thank you very much Daniel!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thank you Daniel. Could you please update the wiki once
>>>>>>>>>>>>>>>>> you are done with the process?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <
>>>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Took me a bit to get to this, sorry. I finally figured
>>>>>>>>>>>>>>>>>> out an approach for updating Go and did so and will be updating the image
>>>>>>>>>>>>>>>>>> momentarily.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I think a more important note is that I tried what
>>>>>>>>>>>>>>>>>> Valentyn was considering, which is SSHing into workers and updating the
>>>>>>>>>>>>>>>>>> dependency. I'll describe the process below, but the summary is that I did
>>>>>>>>>>>>>>>>>> it on one worker with Go so far, saw no problems over the weekend, and
>>>>>>>>>>>>>>>>>> would like to continue updating the rest of the workers if there are no
>>>>>>>>>>>>>>>>>> objections.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Here's a step-by-step of what I did. If we decide to
>>>>>>>>>>>>>>>>>> stick with this approach, these instructions can be added to Confluence:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1. Go to the page for the Jenkins agent you want to
>>>>>>>>>>>>>>>>>> update [1] and click "Mark this node temporarily offline", leaving a reason
>>>>>>>>>>>>>>>>>> such as "Updating X dependency."
>>>>>>>>>>>>>>>>>> 2. Wait until there are no more tests running in that
>>>>>>>>>>>>>>>>>> agent (under "Build Executor Status" on the left of the page).
>>>>>>>>>>>>>>>>>> 3. SSH into the agent and perform the update.
>>>>>>>>>>>>>>>>>> 4. Mark the node as online again.
>>>>>>>>>>>>>>>>>> 5. Repeat for every worker.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> And these are some additional steps if you want to
>>>>>>>>>>>>>>>>>> immediately run a test suite to check that the update worked correctly. For
>>>>>>>>>>>>>>>>>> example in my case, I wanted to check against the Go Postcommit, and it was
>>>>>>>>>>>>>>>>>> a good thing I did, because it actually failed the first time and I had to
>>>>>>>>>>>>>>>>>> go back in to fix a small oversight I made. So doing this after you update
>>>>>>>>>>>>>>>>>> your first worker is probably a good idea before updating the rest:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1. Go to the page for the job you want to run (for
>>>>>>>>>>>>>>>>>> example: [2]).
>>>>>>>>>>>>>>>>>> 2. Click "Configure" on the left menu.
>>>>>>>>>>>>>>>>>> 3. Find the checkmark "Restrict where this project can be
>>>>>>>>>>>>>>>>>> run" and change the restriction from "beam" to the specific name of the
>>>>>>>>>>>>>>>>>> agent (ex. "apache-beam-jenkins-1").
>>>>>>>>>>>>>>>>>> 4. Save and apply that change.
>>>>>>>>>>>>>>>>>> 5. Back on the page for the job, click "Build with
>>>>>>>>>>>>>>>>>> Parameters" on the left menu.
>>>>>>>>>>>>>>>>>> 6. Run the build on "master".
>>>>>>>>>>>>>>>>>> 7. Once you're done checking the results, change
>>>>>>>>>>>>>>>>>> the restriction for the job back to "beam". (This also gets reset once
>>>>>>>>>>>>>>>>>> every 24 hours in case you forget.)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday
>>>>>>>>>>>>>>>>>> evening when it wasn't too busy, and got Go updated and working. I checked
>>>>>>>>>>>>>>>>>> that agent's execution history again today just in case, and it was healthy
>>>>>>>>>>>>>>>>>> over the weekend, with no Go-related problems as far as I could see. If
>>>>>>>>>>>>>>>>>> there's no objections I'd like to go ahead and continue updating the rest
>>>>>>>>>>>>>>>>>> of the workers (I'll do this late at night or over the weekend to avoid
>>>>>>>>>>>>>>>>>> disrupting dev work).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>>>>>>>>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I updated the image in [1], but did not change the
>>>>>>>>>>>>>>>>>>> workers yet to pick up the new image yet. We can do this once we add Go
>>>>>>>>>>>>>>>>>>> changes on top of it.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I am also considering to SSH into every worker and run a
>>>>>>>>>>>>>>>>>>> one-line command that adds the dependency that was missing. It seems to be
>>>>>>>>>>>>>>>>>>> low risk, and  there is a fall-back plan to re-start the worker using the
>>>>>>>>>>>>>>>>>>> saved image - both new and old images are saved and available in Cloud
>>>>>>>>>>>>>>>>>>> Console.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Ideally, we should find a way to do a rolling upgrade
>>>>>>>>>>>>>>>>>>> that a PMC or committer could trigger without logging into every machine.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> @Brian Hulette <bh...@google.com> That button seems
>>>>>>>>>>>>>>>>>>>> like exactly what we'd need. Doing it manually would be a pain, but it's
>>>>>>>>>>>>>>>>>>>> probably still preferable to causing a bunch of aborted tests.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating
>>>>>>>>>>>>>>>>>>>> to do both updates at once is a great idea! I'll message you directly about
>>>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I am also interested in this updating version of
>>>>>>>>>>>>>>>>>>>>> Python on VMs, I need to install Python 3.9. Thanks for looking into this.
>>>>>>>>>>>>>>>>>>>>> We can coordinate together to make one update instead of two.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <
>>>>>>>>>>>>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I'm not sure about best practices here. Out of
>>>>>>>>>>>>>>>>>>>>>> curiosity I just poked around in the Jenkins UI (e.g. [1]) and it looks
>>>>>>>>>>>>>>>>>>>>>> like you can manually "Mark node temporarily offline" when logged in (if
>>>>>>>>>>>>>>>>>>>>>> you're a committer). According to [2] this will prevent it from picking up
>>>>>>>>>>>>>>>>>>>>>> new jobs after it's finished the currently executing ones. Doing that
>>>>>>>>>>>>>>>>>>>>>> manually for every worker could be a pain though.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our
>>>>>>>>>>>>>>>>>>>>>>> Jenkins VMs, and I found these instructions on
>>>>>>>>>>>>>>>>>>>>>>> upgrading software on Jenkins
>>>>>>>>>>>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>>>>>>>>>>>>>>>>>> our cwiki.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I haven't started going through it yet, but I was
>>>>>>>>>>>>>>>>>>>>>>> wondering about the last few steps that involve stopping VMs, deleting boot
>>>>>>>>>>>>>>>>>>>>>>> disks, and restarting executors. Is there some best practice for
>>>>>>>>>>>>>>>>>>>>>>> that section to avoid causing interruptions in our automated testing?
>>>>>>>>>>>>>>>>>>>>>>> Should I be trying to do this outside of peak dev hours, or going one VM at
>>>>>>>>>>>>>>>>>>>>>>> a time so others can pick up extra load, or anything like that?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>

-- 

Daniela Martín (She/Her) | <https://www.wizeline.com/>

Site Reliability Engineer

daniela.martin@wizeline.com

Amado Nervo 2200, Esfera P6, Col. Ciudad del Sol, 45050 Zapopan, Jal.

Follow us Twitter <https://twitter.com/wizelineglobal> | Facebook
<https://www.facebook.com/WizelineGlobal> | Instagram
<https://www.instagram.com/wizelineglobal/> | LinkedIn
<https://www.linkedin.com/company/wizeline>

Share feedback on Clutch <https://clutch.co/review/submit/375119>

-- 
*This email and its contents (including any attachments) are being sent to
you on the condition of confidentiality and may be protected by legal
privilege. Access to this email by anyone other than the intended recipient
is unauthorized. If you are not the intended recipient, please immediately
notify the sender by replying to this message and delete the material
immediately from your system. Any further use, dissemination, distribution
or reproduction of this email is strictly prohibited. Further, no
representation is made with respect to any content contained in this email.*

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Valentyn Tymofieiev <va...@google.com>.
Heads-up, I am planning a Jenkins image upgrade with a minor change to
clean up some unwanted log4j artifacts from gradle caches to silence some
alerts I received. Hopefully noone else is currently doing an upgrade,
otherwise - please reach out.

I will take care of of the PATH issue discussed here (if it is still an
issue).


On Tue, Nov 2, 2021 at 4:16 PM Robert Burke <ro...@frantil.com> wrote:

> TIL as well. Sounds like the right location. Thanks Valentyn!
>
> On Tue, Nov 2, 2021, 11:00 AM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> Yeah,  .profile is only sourced by login shells. Adding the PATH in
>> .bashrc can be a workaround, but since .bashrc is executed every time a new
>> shell runs, PATH variable will be growing with every shell subprocess, so
>> several sources recommend .profile instead, which does not always work.
>> We should be able to fix this by updating  /etc/environment instead (TIL).
>>
>> This is the current content:
>> cat /etc/environment
>>
>> PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"
>>
>>
>>
>>
>> On Mon, Nov 1, 2021 at 10:50 AM Robert Burke <ro...@frantil.com> wrote:
>>
>>> Looks like while .profile was edited to add in a PATH section pointing
>>> to /snap/bin (where go is now installed), it doesn't seem like .profile is
>>> executed by the jenkins login shells.
>>>
>>>
>>>
>>> On Fri, Oct 29, 2021, 6:23 PM Valentyn Tymofieiev <va...@google.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Oct 20, 2021 at 11:16 AM Valentyn Tymofieiev <
>>>> valentyn@google.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Wed, Oct 20, 2021 at 11:12 AM Pablo Estrada <pa...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks everyone for investigating and documenting this. I'll use it
>>>>>> today : )
>>>>>>
>>>>> Dan may be also in the middle of doing this, please coordinate.
>>>>>
>>>>>>
>>>>>> ahem - maybe we should rename the image name/image family names
>>>>>> to jenkins-worker-boot-image ? Does anyone foresee issues if we do that?
>>>>>> Does jenkins depend on these names in some undocumented way?
>>>>>>
>>>>> +1. it should 'just work', need to update the wiki after the change.
>>>>> Jenkins also did a terminology adjustment.
>>>>>
>>>> I had to reimage Jenkins workers again, took care of the rename and
>>>> changed the instructions.
>>>>
>>>> I am not sure what is the status of Go Postcommit problem, but noticed
>>>> that jenkins worker #1 had a different boot disk. I reimaged all workers
>>>> building on top of the latest image from the image family. If Go tests
>>>> start failing, we may need to get help from Dan again.
>>>>
>>>>
>>>>>
>>>>>> On Tue, Oct 19, 2021 at 1:43 PM Daniel Oliveira <
>>>>>> danoliveira@google.com> wrote:
>>>>>>
>>>>>>> I'm ok with deciding to avoid the "lite" update option, feel free to
>>>>>>> revise the instructions as it seems appropriate. As for the issue, I fixed
>>>>>>> it with a workaround that should work until we need to add a new image to
>>>>>>> the agents, and I'm currently investigating the root cause and prepare a
>>>>>>> fixed image.
>>>>>>>
>>>>>>> That said, I think this issue would have still happened even if we
>>>>>>> didn't perform the "lite" update. I'm still trying to figure out the exact
>>>>>>> problem, but it looks to be a PATH issue that wasn't effectively caught by
>>>>>>> the current process. I won't get into details too much in this thread (see
>>>>>>> the Jira for that), but essentially everything works in my environment when
>>>>>>> I SSH into the VMs, but because the location of the "go" command changed in
>>>>>>> the PATH, it seems to have stopped working for every other user, including
>>>>>>> the Jenkins agents. I actually did notice that would happen when I was
>>>>>>> working on the image, but the solution seemed to be to reboot the machine,
>>>>>>> which I assumed happened already since I shut down the VM to image it.
>>>>>>>
>>>>>>> On Tue, Oct 19, 2021 at 12:09 PM Robert Burke <ro...@frantil.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 to only having one way to do things. The Lite option seems
>>>>>>>> liable to cause more problems since it means it's changes can be blown away
>>>>>>>> if a new image isn't prepared anyway.
>>>>>>>> I don't think we are changing the images often enough for it.
>>>>>>>> Perhaps call it the option to test changes if anything?
>>>>>>>>
>>>>>>>> On Tue, Oct 19, 2021, 11:55 AM Valentyn Tymofieiev <
>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>
>>>>>>>>> All workers were updated to use jenkins-slave-boot-image-20211011,
>>>>>>>>> which should have had a go command, but it appears slightly misconfigured.
>>>>>>>>> I reopened BEAM-13037 [1] and added some details there.
>>>>>>>>>
>>>>>>>>> I also added instructions to wiki [2] on how to perform an image
>>>>>>>>> swap and it is actually very straightforward. I think a lesson here is that
>>>>>>>>> making 'lite' upgrades is brittle as misconfigurations could resurface down
>>>>>>>>> the road when the context of the lite upgrade is no longer fresh in our
>>>>>>>>> memory.
>>>>>>>>>
>>>>>>>>> I suggest we revise the instructions to keep only image swap
>>>>>>>>> commands and remove the 'lite' update option. +Daniel Oliveira
>>>>>>>>> <da...@google.com>, WDYT?  In the meantime, we should also
>>>>>>>>> prepare an image that fixes the misconfiguration. Would you be able to help
>>>>>>>>> with that? Thank you.
>>>>>>>>>
>>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-13037
>>>>>>>>> [2]
>>>>>>>>> https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <ro...@frantil.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> FYI it looks like all the Go tests are now failing because it
>>>>>>>>>> can't find the Go command at all.
>>>>>>>>>> Did a Jenkins image without Go (v1.16+) pre-installed get pushed?
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev <
>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Daniel,
>>>>>>>>>>>
>>>>>>>>>>> I can recreate the VMs on new disks.
>>>>>>>>>>>
>>>>>>>>>>> We currently have a set of stopped jenkins workers (named:
>>>>>>>>>>> apache-beam-jenkins-##) and running workers (named:
>>>>>>>>>>> apache-ci-beam-jenkins-##)
>>>>>>>>>>>
>>>>>>>>>>> Are there any concerns about deleting the stopped group of
>>>>>>>>>>> workers?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thank you Daniel, Valentyn!
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <
>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I performed a light update of both Go and Python (from
>>>>>>>>>>>>> Valentyn's update) on each worker VM over the weekend. I also added
>>>>>>>>>>>>> additional instructions for the light update to Confluence (as an
>>>>>>>>>>>>> alternative to the current instructions).
>>>>>>>>>>>>>
>>>>>>>>>>>>> There is still reason to perform a full update at some point:
>>>>>>>>>>>>> Valentyn updated the VM image from 500 GB to 1000 GB of storage, which
>>>>>>>>>>>>> requires a full update to actually take effect.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <
>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> > 3. SSH into the agent and perform the update.
>>>>>>>>>>>>>> So, this would be a 'lite' version of the update, where we
>>>>>>>>>>>>>> make changes to the live worker without recreating worker VM with a new
>>>>>>>>>>>>>> image? We could perhaps document both options, and also make it clear that
>>>>>>>>>>>>>> producing a VM image that has necessary updates is mandatory even if we
>>>>>>>>>>>>>> perform 'lite' updates without recreating the worker.
>>>>>>>>>>>>>> Also, for a lite update, marking the Jenkins offer offline
>>>>>>>>>>>>>> may be optional, as some updates might not be disruptive (such as
>>>>>>>>>>>>>> installing some software that will not be used immediately).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <
>>>>>>>>>>>>>> robert@frantil.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> SGTM. Thank you very much Daniel!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you Daniel. Could you please update the wiki once you
>>>>>>>>>>>>>>>> are done with the process?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <
>>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Took me a bit to get to this, sorry. I finally figured out
>>>>>>>>>>>>>>>>> an approach for updating Go and did so and will be updating the image
>>>>>>>>>>>>>>>>> momentarily.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think a more important note is that I tried what
>>>>>>>>>>>>>>>>> Valentyn was considering, which is SSHing into workers and updating the
>>>>>>>>>>>>>>>>> dependency. I'll describe the process below, but the summary is that I did
>>>>>>>>>>>>>>>>> it on one worker with Go so far, saw no problems over the weekend, and
>>>>>>>>>>>>>>>>> would like to continue updating the rest of the workers if there are no
>>>>>>>>>>>>>>>>> objections.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Here's a step-by-step of what I did. If we decide to stick
>>>>>>>>>>>>>>>>> with this approach, these instructions can be added to Confluence:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. Go to the page for the Jenkins agent you want to update
>>>>>>>>>>>>>>>>> [1] and click "Mark this node temporarily offline", leaving a reason such
>>>>>>>>>>>>>>>>> as "Updating X dependency."
>>>>>>>>>>>>>>>>> 2. Wait until there are no more tests running in that
>>>>>>>>>>>>>>>>> agent (under "Build Executor Status" on the left of the page).
>>>>>>>>>>>>>>>>> 3. SSH into the agent and perform the update.
>>>>>>>>>>>>>>>>> 4. Mark the node as online again.
>>>>>>>>>>>>>>>>> 5. Repeat for every worker.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> And these are some additional steps if you want to
>>>>>>>>>>>>>>>>> immediately run a test suite to check that the update worked correctly. For
>>>>>>>>>>>>>>>>> example in my case, I wanted to check against the Go Postcommit, and it was
>>>>>>>>>>>>>>>>> a good thing I did, because it actually failed the first time and I had to
>>>>>>>>>>>>>>>>> go back in to fix a small oversight I made. So doing this after you update
>>>>>>>>>>>>>>>>> your first worker is probably a good idea before updating the rest:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. Go to the page for the job you want to run (for
>>>>>>>>>>>>>>>>> example: [2]).
>>>>>>>>>>>>>>>>> 2. Click "Configure" on the left menu.
>>>>>>>>>>>>>>>>> 3. Find the checkmark "Restrict where this project can be
>>>>>>>>>>>>>>>>> run" and change the restriction from "beam" to the specific name of the
>>>>>>>>>>>>>>>>> agent (ex. "apache-beam-jenkins-1").
>>>>>>>>>>>>>>>>> 4. Save and apply that change.
>>>>>>>>>>>>>>>>> 5. Back on the page for the job, click "Build with
>>>>>>>>>>>>>>>>> Parameters" on the left menu.
>>>>>>>>>>>>>>>>> 6. Run the build on "master".
>>>>>>>>>>>>>>>>> 7. Once you're done checking the results, change
>>>>>>>>>>>>>>>>> the restriction for the job back to "beam". (This also gets reset once
>>>>>>>>>>>>>>>>> every 24 hours in case you forget.)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday
>>>>>>>>>>>>>>>>> evening when it wasn't too busy, and got Go updated and working. I checked
>>>>>>>>>>>>>>>>> that agent's execution history again today just in case, and it was healthy
>>>>>>>>>>>>>>>>> over the weekend, with no Go-related problems as far as I could see. If
>>>>>>>>>>>>>>>>> there's no objections I'd like to go ahead and continue updating the rest
>>>>>>>>>>>>>>>>> of the workers (I'll do this late at night or over the weekend to avoid
>>>>>>>>>>>>>>>>> disrupting dev work).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>>>>>>>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I updated the image in [1], but did not change the
>>>>>>>>>>>>>>>>>> workers yet to pick up the new image yet. We can do this once we add Go
>>>>>>>>>>>>>>>>>> changes on top of it.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I am also considering to SSH into every worker and run a
>>>>>>>>>>>>>>>>>> one-line command that adds the dependency that was missing. It seems to be
>>>>>>>>>>>>>>>>>> low risk, and  there is a fall-back plan to re-start the worker using the
>>>>>>>>>>>>>>>>>> saved image - both new and old images are saved and available in Cloud
>>>>>>>>>>>>>>>>>> Console.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Ideally, we should find a way to do a rolling upgrade
>>>>>>>>>>>>>>>>>> that a PMC or committer could trigger without logging into every machine.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> @Brian Hulette <bh...@google.com> That button seems
>>>>>>>>>>>>>>>>>>> like exactly what we'd need. Doing it manually would be a pain, but it's
>>>>>>>>>>>>>>>>>>> probably still preferable to causing a bunch of aborted tests.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating
>>>>>>>>>>>>>>>>>>> to do both updates at once is a great idea! I'll message you directly about
>>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I am also interested in this updating version of Python
>>>>>>>>>>>>>>>>>>>> on VMs, I need to install Python 3.9. Thanks for looking into this.  We can
>>>>>>>>>>>>>>>>>>>> coordinate together to make one update instead of two.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <
>>>>>>>>>>>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'm not sure about best practices here. Out of
>>>>>>>>>>>>>>>>>>>>> curiosity I just poked around in the Jenkins UI (e.g. [1]) and it looks
>>>>>>>>>>>>>>>>>>>>> like you can manually "Mark node temporarily offline" when logged in (if
>>>>>>>>>>>>>>>>>>>>> you're a committer). According to [2] this will prevent it from picking up
>>>>>>>>>>>>>>>>>>>>> new jobs after it's finished the currently executing ones. Doing that
>>>>>>>>>>>>>>>>>>>>> manually for every worker could be a pain though.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our
>>>>>>>>>>>>>>>>>>>>>> Jenkins VMs, and I found these instructions on
>>>>>>>>>>>>>>>>>>>>>> upgrading software on Jenkins
>>>>>>>>>>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>>>>>>>>>>>>>>>>> our cwiki.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I haven't started going through it yet, but I was
>>>>>>>>>>>>>>>>>>>>>> wondering about the last few steps that involve stopping VMs, deleting boot
>>>>>>>>>>>>>>>>>>>>>> disks, and restarting executors. Is there some best practice for
>>>>>>>>>>>>>>>>>>>>>> that section to avoid causing interruptions in our automated testing?
>>>>>>>>>>>>>>>>>>>>>> Should I be trying to do this outside of peak dev hours, or going one VM at
>>>>>>>>>>>>>>>>>>>>>> a time so others can pick up extra load, or anything like that?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Robert Burke <ro...@frantil.com>.
TIL as well. Sounds like the right location. Thanks Valentyn!

On Tue, Nov 2, 2021, 11:00 AM Valentyn Tymofieiev <va...@google.com>
wrote:

> Yeah,  .profile is only sourced by login shells. Adding the PATH in
> .bashrc can be a workaround, but since .bashrc is executed every time a new
> shell runs, PATH variable will be growing with every shell subprocess, so
> several sources recommend .profile instead, which does not always work.
> We should be able to fix this by updating  /etc/environment instead (TIL).
>
> This is the current content:
> cat /etc/environment
>
> PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"
>
>
>
>
> On Mon, Nov 1, 2021 at 10:50 AM Robert Burke <ro...@frantil.com> wrote:
>
>> Looks like while .profile was edited to add in a PATH section pointing to
>> /snap/bin (where go is now installed), it doesn't seem like .profile is
>> executed by the jenkins login shells.
>>
>>
>>
>> On Fri, Oct 29, 2021, 6:23 PM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, Oct 20, 2021 at 11:16 AM Valentyn Tymofieiev <
>>> valentyn@google.com> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Oct 20, 2021 at 11:12 AM Pablo Estrada <pa...@google.com>
>>>> wrote:
>>>>
>>>>> Thanks everyone for investigating and documenting this. I'll use it
>>>>> today : )
>>>>>
>>>> Dan may be also in the middle of doing this, please coordinate.
>>>>
>>>>>
>>>>> ahem - maybe we should rename the image name/image family names
>>>>> to jenkins-worker-boot-image ? Does anyone foresee issues if we do that?
>>>>> Does jenkins depend on these names in some undocumented way?
>>>>>
>>>> +1. it should 'just work', need to update the wiki after the change.
>>>> Jenkins also did a terminology adjustment.
>>>>
>>> I had to reimage Jenkins workers again, took care of the rename and
>>> changed the instructions.
>>>
>>> I am not sure what is the status of Go Postcommit problem, but noticed
>>> that jenkins worker #1 had a different boot disk. I reimaged all workers
>>> building on top of the latest image from the image family. If Go tests
>>> start failing, we may need to get help from Dan again.
>>>
>>>
>>>>
>>>>> On Tue, Oct 19, 2021 at 1:43 PM Daniel Oliveira <
>>>>> danoliveira@google.com> wrote:
>>>>>
>>>>>> I'm ok with deciding to avoid the "lite" update option, feel free to
>>>>>> revise the instructions as it seems appropriate. As for the issue, I fixed
>>>>>> it with a workaround that should work until we need to add a new image to
>>>>>> the agents, and I'm currently investigating the root cause and prepare a
>>>>>> fixed image.
>>>>>>
>>>>>> That said, I think this issue would have still happened even if we
>>>>>> didn't perform the "lite" update. I'm still trying to figure out the exact
>>>>>> problem, but it looks to be a PATH issue that wasn't effectively caught by
>>>>>> the current process. I won't get into details too much in this thread (see
>>>>>> the Jira for that), but essentially everything works in my environment when
>>>>>> I SSH into the VMs, but because the location of the "go" command changed in
>>>>>> the PATH, it seems to have stopped working for every other user, including
>>>>>> the Jenkins agents. I actually did notice that would happen when I was
>>>>>> working on the image, but the solution seemed to be to reboot the machine,
>>>>>> which I assumed happened already since I shut down the VM to image it.
>>>>>>
>>>>>> On Tue, Oct 19, 2021 at 12:09 PM Robert Burke <ro...@frantil.com>
>>>>>> wrote:
>>>>>>
>>>>>>> +1 to only having one way to do things. The Lite option seems liable
>>>>>>> to cause more problems since it means it's changes can be blown away if a
>>>>>>> new image isn't prepared anyway.
>>>>>>> I don't think we are changing the images often enough for it.
>>>>>>> Perhaps call it the option to test changes if anything?
>>>>>>>
>>>>>>> On Tue, Oct 19, 2021, 11:55 AM Valentyn Tymofieiev <
>>>>>>> valentyn@google.com> wrote:
>>>>>>>
>>>>>>>> All workers were updated to use jenkins-slave-boot-image-20211011,
>>>>>>>> which should have had a go command, but it appears slightly misconfigured.
>>>>>>>> I reopened BEAM-13037 [1] and added some details there.
>>>>>>>>
>>>>>>>> I also added instructions to wiki [2] on how to perform an image
>>>>>>>> swap and it is actually very straightforward. I think a lesson here is that
>>>>>>>> making 'lite' upgrades is brittle as misconfigurations could resurface down
>>>>>>>> the road when the context of the lite upgrade is no longer fresh in our
>>>>>>>> memory.
>>>>>>>>
>>>>>>>> I suggest we revise the instructions to keep only image swap
>>>>>>>> commands and remove the 'lite' update option. +Daniel Oliveira
>>>>>>>> <da...@google.com>, WDYT?  In the meantime, we should also
>>>>>>>> prepare an image that fixes the misconfiguration. Would you be able to help
>>>>>>>> with that? Thank you.
>>>>>>>>
>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-13037
>>>>>>>> [2]
>>>>>>>> https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <ro...@frantil.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> FYI it looks like all the Go tests are now failing because it
>>>>>>>>> can't find the Go command at all.
>>>>>>>>> Did a Jenkins image without Go (v1.16+) pre-installed get pushed?
>>>>>>>>>
>>>>>>>>> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev <
>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Daniel,
>>>>>>>>>>
>>>>>>>>>> I can recreate the VMs on new disks.
>>>>>>>>>>
>>>>>>>>>> We currently have a set of stopped jenkins workers (named:
>>>>>>>>>> apache-beam-jenkins-##) and running workers (named:
>>>>>>>>>> apache-ci-beam-jenkins-##)
>>>>>>>>>>
>>>>>>>>>> Are there any concerns about deleting the stopped group of
>>>>>>>>>> workers?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you Daniel, Valentyn!
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <
>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I performed a light update of both Go and Python (from
>>>>>>>>>>>> Valentyn's update) on each worker VM over the weekend. I also added
>>>>>>>>>>>> additional instructions for the light update to Confluence (as an
>>>>>>>>>>>> alternative to the current instructions).
>>>>>>>>>>>>
>>>>>>>>>>>> There is still reason to perform a full update at some point:
>>>>>>>>>>>> Valentyn updated the VM image from 500 GB to 1000 GB of storage, which
>>>>>>>>>>>> requires a full update to actually take effect.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <
>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> > 3. SSH into the agent and perform the update.
>>>>>>>>>>>>> So, this would be a 'lite' version of the update, where we
>>>>>>>>>>>>> make changes to the live worker without recreating worker VM with a new
>>>>>>>>>>>>> image? We could perhaps document both options, and also make it clear that
>>>>>>>>>>>>> producing a VM image that has necessary updates is mandatory even if we
>>>>>>>>>>>>> perform 'lite' updates without recreating the worker.
>>>>>>>>>>>>> Also, for a lite update, marking the Jenkins offer offline may
>>>>>>>>>>>>> be optional, as some updates might not be disruptive (such as installing
>>>>>>>>>>>>> some software that will not be used immediately).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <
>>>>>>>>>>>>> robert@frantil.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> SGTM. Thank you very much Daniel!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you Daniel. Could you please update the wiki once you
>>>>>>>>>>>>>>> are done with the process?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <
>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Took me a bit to get to this, sorry. I finally figured out
>>>>>>>>>>>>>>>> an approach for updating Go and did so and will be updating the image
>>>>>>>>>>>>>>>> momentarily.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think a more important note is that I tried what Valentyn
>>>>>>>>>>>>>>>> was considering, which is SSHing into workers and updating the dependency.
>>>>>>>>>>>>>>>> I'll describe the process below, but the summary is that I did it on one
>>>>>>>>>>>>>>>> worker with Go so far, saw no problems over the weekend, and would like to
>>>>>>>>>>>>>>>> continue updating the rest of the workers if there are no objections.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here's a step-by-step of what I did. If we decide to stick
>>>>>>>>>>>>>>>> with this approach, these instructions can be added to Confluence:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. Go to the page for the Jenkins agent you want to update
>>>>>>>>>>>>>>>> [1] and click "Mark this node temporarily offline", leaving a reason such
>>>>>>>>>>>>>>>> as "Updating X dependency."
>>>>>>>>>>>>>>>> 2. Wait until there are no more tests running in that agent
>>>>>>>>>>>>>>>> (under "Build Executor Status" on the left of the page).
>>>>>>>>>>>>>>>> 3. SSH into the agent and perform the update.
>>>>>>>>>>>>>>>> 4. Mark the node as online again.
>>>>>>>>>>>>>>>> 5. Repeat for every worker.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> And these are some additional steps if you want to
>>>>>>>>>>>>>>>> immediately run a test suite to check that the update worked correctly. For
>>>>>>>>>>>>>>>> example in my case, I wanted to check against the Go Postcommit, and it was
>>>>>>>>>>>>>>>> a good thing I did, because it actually failed the first time and I had to
>>>>>>>>>>>>>>>> go back in to fix a small oversight I made. So doing this after you update
>>>>>>>>>>>>>>>> your first worker is probably a good idea before updating the rest:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. Go to the page for the job you want to run (for example:
>>>>>>>>>>>>>>>> [2]).
>>>>>>>>>>>>>>>> 2. Click "Configure" on the left menu.
>>>>>>>>>>>>>>>> 3. Find the checkmark "Restrict where this project can be
>>>>>>>>>>>>>>>> run" and change the restriction from "beam" to the specific name of the
>>>>>>>>>>>>>>>> agent (ex. "apache-beam-jenkins-1").
>>>>>>>>>>>>>>>> 4. Save and apply that change.
>>>>>>>>>>>>>>>> 5. Back on the page for the job, click "Build with
>>>>>>>>>>>>>>>> Parameters" on the left menu.
>>>>>>>>>>>>>>>> 6. Run the build on "master".
>>>>>>>>>>>>>>>> 7. Once you're done checking the results, change
>>>>>>>>>>>>>>>> the restriction for the job back to "beam". (This also gets reset once
>>>>>>>>>>>>>>>> every 24 hours in case you forget.)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday
>>>>>>>>>>>>>>>> evening when it wasn't too busy, and got Go updated and working. I checked
>>>>>>>>>>>>>>>> that agent's execution history again today just in case, and it was healthy
>>>>>>>>>>>>>>>> over the weekend, with no Go-related problems as far as I could see. If
>>>>>>>>>>>>>>>> there's no objections I'd like to go ahead and continue updating the rest
>>>>>>>>>>>>>>>> of the workers (I'll do this late at night or over the weekend to avoid
>>>>>>>>>>>>>>>> disrupting dev work).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>>>>>>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I updated the image in [1], but did not change the workers
>>>>>>>>>>>>>>>>> yet to pick up the new image yet. We can do this once we add Go changes on
>>>>>>>>>>>>>>>>> top of it.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am also considering to SSH into every worker and run a
>>>>>>>>>>>>>>>>> one-line command that adds the dependency that was missing. It seems to be
>>>>>>>>>>>>>>>>> low risk, and  there is a fall-back plan to re-start the worker using the
>>>>>>>>>>>>>>>>> saved image - both new and old images are saved and available in Cloud
>>>>>>>>>>>>>>>>> Console.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ideally, we should find a way to do a rolling upgrade that
>>>>>>>>>>>>>>>>> a PMC or committer could trigger without logging into every machine.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @Brian Hulette <bh...@google.com> That button seems
>>>>>>>>>>>>>>>>>> like exactly what we'd need. Doing it manually would be a pain, but it's
>>>>>>>>>>>>>>>>>> probably still preferable to causing a bunch of aborted tests.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating
>>>>>>>>>>>>>>>>>> to do both updates at once is a great idea! I'll message you directly about
>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I am also interested in this updating version of Python
>>>>>>>>>>>>>>>>>>> on VMs, I need to install Python 3.9. Thanks for looking into this.  We can
>>>>>>>>>>>>>>>>>>> coordinate together to make one update instead of two.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <
>>>>>>>>>>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I'm not sure about best practices here. Out of
>>>>>>>>>>>>>>>>>>>> curiosity I just poked around in the Jenkins UI (e.g. [1]) and it looks
>>>>>>>>>>>>>>>>>>>> like you can manually "Mark node temporarily offline" when logged in (if
>>>>>>>>>>>>>>>>>>>> you're a committer). According to [2] this will prevent it from picking up
>>>>>>>>>>>>>>>>>>>> new jobs after it's finished the currently executing ones. Doing that
>>>>>>>>>>>>>>>>>>>> manually for every worker could be a pain though.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our
>>>>>>>>>>>>>>>>>>>>> Jenkins VMs, and I found these instructions on
>>>>>>>>>>>>>>>>>>>>> upgrading software on Jenkins
>>>>>>>>>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>>>>>>>>>>>>>>>> our cwiki.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I haven't started going through it yet, but I was
>>>>>>>>>>>>>>>>>>>>> wondering about the last few steps that involve stopping VMs, deleting boot
>>>>>>>>>>>>>>>>>>>>> disks, and restarting executors. Is there some best practice for
>>>>>>>>>>>>>>>>>>>>> that section to avoid causing interruptions in our automated testing?
>>>>>>>>>>>>>>>>>>>>> Should I be trying to do this outside of peak dev hours, or going one VM at
>>>>>>>>>>>>>>>>>>>>> a time so others can pick up extra load, or anything like that?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Valentyn Tymofieiev <va...@google.com>.
Yeah,  .profile is only sourced by login shells. Adding the PATH in .bashrc
can be a workaround, but since .bashrc is executed every time a new shell
runs, PATH variable will be growing with every shell subprocess, so several
sources recommend .profile instead, which does not always work.
We should be able to fix this by updating  /etc/environment instead (TIL).

This is the current content:
cat /etc/environment
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"




On Mon, Nov 1, 2021 at 10:50 AM Robert Burke <ro...@frantil.com> wrote:

> Looks like while .profile was edited to add in a PATH section pointing to
> /snap/bin (where go is now installed), it doesn't seem like .profile is
> executed by the jenkins login shells.
>
>
>
> On Fri, Oct 29, 2021, 6:23 PM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>>
>>
>> On Wed, Oct 20, 2021 at 11:16 AM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, Oct 20, 2021 at 11:12 AM Pablo Estrada <pa...@google.com>
>>> wrote:
>>>
>>>> Thanks everyone for investigating and documenting this. I'll use it
>>>> today : )
>>>>
>>> Dan may be also in the middle of doing this, please coordinate.
>>>
>>>>
>>>> ahem - maybe we should rename the image name/image family names
>>>> to jenkins-worker-boot-image ? Does anyone foresee issues if we do that?
>>>> Does jenkins depend on these names in some undocumented way?
>>>>
>>> +1. it should 'just work', need to update the wiki after the change.
>>> Jenkins also did a terminology adjustment.
>>>
>> I had to reimage Jenkins workers again, took care of the rename and
>> changed the instructions.
>>
>> I am not sure what is the status of Go Postcommit problem, but noticed
>> that jenkins worker #1 had a different boot disk. I reimaged all workers
>> building on top of the latest image from the image family. If Go tests
>> start failing, we may need to get help from Dan again.
>>
>>
>>>
>>>> On Tue, Oct 19, 2021 at 1:43 PM Daniel Oliveira <da...@google.com>
>>>> wrote:
>>>>
>>>>> I'm ok with deciding to avoid the "lite" update option, feel free to
>>>>> revise the instructions as it seems appropriate. As for the issue, I fixed
>>>>> it with a workaround that should work until we need to add a new image to
>>>>> the agents, and I'm currently investigating the root cause and prepare a
>>>>> fixed image.
>>>>>
>>>>> That said, I think this issue would have still happened even if we
>>>>> didn't perform the "lite" update. I'm still trying to figure out the exact
>>>>> problem, but it looks to be a PATH issue that wasn't effectively caught by
>>>>> the current process. I won't get into details too much in this thread (see
>>>>> the Jira for that), but essentially everything works in my environment when
>>>>> I SSH into the VMs, but because the location of the "go" command changed in
>>>>> the PATH, it seems to have stopped working for every other user, including
>>>>> the Jenkins agents. I actually did notice that would happen when I was
>>>>> working on the image, but the solution seemed to be to reboot the machine,
>>>>> which I assumed happened already since I shut down the VM to image it.
>>>>>
>>>>> On Tue, Oct 19, 2021 at 12:09 PM Robert Burke <ro...@frantil.com>
>>>>> wrote:
>>>>>
>>>>>> +1 to only having one way to do things. The Lite option seems liable
>>>>>> to cause more problems since it means it's changes can be blown away if a
>>>>>> new image isn't prepared anyway.
>>>>>> I don't think we are changing the images often enough for it.
>>>>>> Perhaps call it the option to test changes if anything?
>>>>>>
>>>>>> On Tue, Oct 19, 2021, 11:55 AM Valentyn Tymofieiev <
>>>>>> valentyn@google.com> wrote:
>>>>>>
>>>>>>> All workers were updated to use jenkins-slave-boot-image-20211011,
>>>>>>> which should have had a go command, but it appears slightly misconfigured.
>>>>>>> I reopened BEAM-13037 [1] and added some details there.
>>>>>>>
>>>>>>> I also added instructions to wiki [2] on how to perform an image
>>>>>>> swap and it is actually very straightforward. I think a lesson here is that
>>>>>>> making 'lite' upgrades is brittle as misconfigurations could resurface down
>>>>>>> the road when the context of the lite upgrade is no longer fresh in our
>>>>>>> memory.
>>>>>>>
>>>>>>> I suggest we revise the instructions to keep only image swap
>>>>>>> commands and remove the 'lite' update option. +Daniel Oliveira
>>>>>>> <da...@google.com>, WDYT?  In the meantime, we should also
>>>>>>> prepare an image that fixes the misconfiguration. Would you be able to help
>>>>>>> with that? Thank you.
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-13037
>>>>>>> [2]
>>>>>>> https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <ro...@frantil.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> FYI it looks like all the Go tests are now failing because it can't
>>>>>>>> find the Go command at all.
>>>>>>>> Did a Jenkins image without Go (v1.16+) pre-installed get pushed?
>>>>>>>>
>>>>>>>> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev <
>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks Daniel,
>>>>>>>>>
>>>>>>>>> I can recreate the VMs on new disks.
>>>>>>>>>
>>>>>>>>> We currently have a set of stopped jenkins workers (named:
>>>>>>>>> apache-beam-jenkins-##) and running workers (named:
>>>>>>>>> apache-ci-beam-jenkins-##)
>>>>>>>>>
>>>>>>>>> Are there any concerns about deleting the stopped group of workers?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thank you Daniel, Valentyn!
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <
>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I performed a light update of both Go and Python (from
>>>>>>>>>>> Valentyn's update) on each worker VM over the weekend. I also added
>>>>>>>>>>> additional instructions for the light update to Confluence (as an
>>>>>>>>>>> alternative to the current instructions).
>>>>>>>>>>>
>>>>>>>>>>> There is still reason to perform a full update at some point:
>>>>>>>>>>> Valentyn updated the VM image from 500 GB to 1000 GB of storage, which
>>>>>>>>>>> requires a full update to actually take effect.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <
>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> > 3. SSH into the agent and perform the update.
>>>>>>>>>>>> So, this would be a 'lite' version of the update, where we make
>>>>>>>>>>>> changes to the live worker without recreating worker VM with a new image?
>>>>>>>>>>>> We could perhaps document both options, and also make it clear that
>>>>>>>>>>>> producing a VM image that has necessary updates is mandatory even if we
>>>>>>>>>>>> perform 'lite' updates without recreating the worker.
>>>>>>>>>>>> Also, for a lite update, marking the Jenkins offer offline may
>>>>>>>>>>>> be optional, as some updates might not be disruptive (such as installing
>>>>>>>>>>>> some software that will not be used immediately).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <
>>>>>>>>>>>> robert@frantil.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> SGTM. Thank you very much Daniel!
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you Daniel. Could you please update the wiki once you
>>>>>>>>>>>>>> are done with the process?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <
>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Took me a bit to get to this, sorry. I finally figured out
>>>>>>>>>>>>>>> an approach for updating Go and did so and will be updating the image
>>>>>>>>>>>>>>> momentarily.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think a more important note is that I tried what Valentyn
>>>>>>>>>>>>>>> was considering, which is SSHing into workers and updating the dependency.
>>>>>>>>>>>>>>> I'll describe the process below, but the summary is that I did it on one
>>>>>>>>>>>>>>> worker with Go so far, saw no problems over the weekend, and would like to
>>>>>>>>>>>>>>> continue updating the rest of the workers if there are no objections.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here's a step-by-step of what I did. If we decide to stick
>>>>>>>>>>>>>>> with this approach, these instructions can be added to Confluence:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. Go to the page for the Jenkins agent you want to update
>>>>>>>>>>>>>>> [1] and click "Mark this node temporarily offline", leaving a reason such
>>>>>>>>>>>>>>> as "Updating X dependency."
>>>>>>>>>>>>>>> 2. Wait until there are no more tests running in that agent
>>>>>>>>>>>>>>> (under "Build Executor Status" on the left of the page).
>>>>>>>>>>>>>>> 3. SSH into the agent and perform the update.
>>>>>>>>>>>>>>> 4. Mark the node as online again.
>>>>>>>>>>>>>>> 5. Repeat for every worker.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And these are some additional steps if you want to
>>>>>>>>>>>>>>> immediately run a test suite to check that the update worked correctly. For
>>>>>>>>>>>>>>> example in my case, I wanted to check against the Go Postcommit, and it was
>>>>>>>>>>>>>>> a good thing I did, because it actually failed the first time and I had to
>>>>>>>>>>>>>>> go back in to fix a small oversight I made. So doing this after you update
>>>>>>>>>>>>>>> your first worker is probably a good idea before updating the rest:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. Go to the page for the job you want to run (for example:
>>>>>>>>>>>>>>> [2]).
>>>>>>>>>>>>>>> 2. Click "Configure" on the left menu.
>>>>>>>>>>>>>>> 3. Find the checkmark "Restrict where this project can be
>>>>>>>>>>>>>>> run" and change the restriction from "beam" to the specific name of the
>>>>>>>>>>>>>>> agent (ex. "apache-beam-jenkins-1").
>>>>>>>>>>>>>>> 4. Save and apply that change.
>>>>>>>>>>>>>>> 5. Back on the page for the job, click "Build with
>>>>>>>>>>>>>>> Parameters" on the left menu.
>>>>>>>>>>>>>>> 6. Run the build on "master".
>>>>>>>>>>>>>>> 7. Once you're done checking the results, change
>>>>>>>>>>>>>>> the restriction for the job back to "beam". (This also gets reset once
>>>>>>>>>>>>>>> every 24 hours in case you forget.)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday
>>>>>>>>>>>>>>> evening when it wasn't too busy, and got Go updated and working. I checked
>>>>>>>>>>>>>>> that agent's execution history again today just in case, and it was healthy
>>>>>>>>>>>>>>> over the weekend, with no Go-related problems as far as I could see. If
>>>>>>>>>>>>>>> there's no objections I'd like to go ahead and continue updating the rest
>>>>>>>>>>>>>>> of the workers (I'll do this late at night or over the weekend to avoid
>>>>>>>>>>>>>>> disrupting dev work).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>>>>>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I updated the image in [1], but did not change the workers
>>>>>>>>>>>>>>>> yet to pick up the new image yet. We can do this once we add Go changes on
>>>>>>>>>>>>>>>> top of it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am also considering to SSH into every worker and run a
>>>>>>>>>>>>>>>> one-line command that adds the dependency that was missing. It seems to be
>>>>>>>>>>>>>>>> low risk, and  there is a fall-back plan to re-start the worker using the
>>>>>>>>>>>>>>>> saved image - both new and old images are saved and available in Cloud
>>>>>>>>>>>>>>>> Console.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ideally, we should find a way to do a rolling upgrade that
>>>>>>>>>>>>>>>> a PMC or committer could trigger without logging into every machine.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @Brian Hulette <bh...@google.com> That button seems
>>>>>>>>>>>>>>>>> like exactly what we'd need. Doing it manually would be a pain, but it's
>>>>>>>>>>>>>>>>> probably still preferable to causing a bunch of aborted tests.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating
>>>>>>>>>>>>>>>>> to do both updates at once is a great idea! I'll message you directly about
>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I am also interested in this updating version of Python
>>>>>>>>>>>>>>>>>> on VMs, I need to install Python 3.9. Thanks for looking into this.  We can
>>>>>>>>>>>>>>>>>> coordinate together to make one update instead of two.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <
>>>>>>>>>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm not sure about best practices here. Out of curiosity
>>>>>>>>>>>>>>>>>>> I just poked around in the Jenkins UI (e.g. [1]) and it looks like you can
>>>>>>>>>>>>>>>>>>> manually "Mark node temporarily offline" when logged in (if you're a
>>>>>>>>>>>>>>>>>>> committer). According to [2] this will prevent it from picking up new jobs
>>>>>>>>>>>>>>>>>>> after it's finished the currently executing ones. Doing that manually for
>>>>>>>>>>>>>>>>>>> every worker could be a pain though.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our
>>>>>>>>>>>>>>>>>>>> Jenkins VMs, and I found these instructions on
>>>>>>>>>>>>>>>>>>>> upgrading software on Jenkins
>>>>>>>>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>>>>>>>>>>>>>>> our cwiki.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I haven't started going through it yet, but I was
>>>>>>>>>>>>>>>>>>>> wondering about the last few steps that involve stopping VMs, deleting boot
>>>>>>>>>>>>>>>>>>>> disks, and restarting executors. Is there some best practice for
>>>>>>>>>>>>>>>>>>>> that section to avoid causing interruptions in our automated testing?
>>>>>>>>>>>>>>>>>>>> Should I be trying to do this outside of peak dev hours, or going one VM at
>>>>>>>>>>>>>>>>>>>> a time so others can pick up extra load, or anything like that?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Robert Burke <ro...@frantil.com>.
Looks like while .profile was edited to add in a PATH section pointing to
/snap/bin (where go is now installed), it doesn't seem like .profile is
executed by the jenkins login shells.



On Fri, Oct 29, 2021, 6:23 PM Valentyn Tymofieiev <va...@google.com>
wrote:

>
>
> On Wed, Oct 20, 2021 at 11:16 AM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>>
>>
>> On Wed, Oct 20, 2021 at 11:12 AM Pablo Estrada <pa...@google.com>
>> wrote:
>>
>>> Thanks everyone for investigating and documenting this. I'll use it
>>> today : )
>>>
>> Dan may be also in the middle of doing this, please coordinate.
>>
>>>
>>> ahem - maybe we should rename the image name/image family names
>>> to jenkins-worker-boot-image ? Does anyone foresee issues if we do that?
>>> Does jenkins depend on these names in some undocumented way?
>>>
>> +1. it should 'just work', need to update the wiki after the change.
>> Jenkins also did a terminology adjustment.
>>
> I had to reimage Jenkins workers again, took care of the rename and
> changed the instructions.
>
> I am not sure what is the status of Go Postcommit problem, but noticed
> that jenkins worker #1 had a different boot disk. I reimaged all workers
> building on top of the latest image from the image family. If Go tests
> start failing, we may need to get help from Dan again.
>
>
>>
>>> On Tue, Oct 19, 2021 at 1:43 PM Daniel Oliveira <da...@google.com>
>>> wrote:
>>>
>>>> I'm ok with deciding to avoid the "lite" update option, feel free to
>>>> revise the instructions as it seems appropriate. As for the issue, I fixed
>>>> it with a workaround that should work until we need to add a new image to
>>>> the agents, and I'm currently investigating the root cause and prepare a
>>>> fixed image.
>>>>
>>>> That said, I think this issue would have still happened even if we
>>>> didn't perform the "lite" update. I'm still trying to figure out the exact
>>>> problem, but it looks to be a PATH issue that wasn't effectively caught by
>>>> the current process. I won't get into details too much in this thread (see
>>>> the Jira for that), but essentially everything works in my environment when
>>>> I SSH into the VMs, but because the location of the "go" command changed in
>>>> the PATH, it seems to have stopped working for every other user, including
>>>> the Jenkins agents. I actually did notice that would happen when I was
>>>> working on the image, but the solution seemed to be to reboot the machine,
>>>> which I assumed happened already since I shut down the VM to image it.
>>>>
>>>> On Tue, Oct 19, 2021 at 12:09 PM Robert Burke <ro...@frantil.com>
>>>> wrote:
>>>>
>>>>> +1 to only having one way to do things. The Lite option seems liable
>>>>> to cause more problems since it means it's changes can be blown away if a
>>>>> new image isn't prepared anyway.
>>>>> I don't think we are changing the images often enough for it.  Perhaps
>>>>> call it the option to test changes if anything?
>>>>>
>>>>> On Tue, Oct 19, 2021, 11:55 AM Valentyn Tymofieiev <
>>>>> valentyn@google.com> wrote:
>>>>>
>>>>>> All workers were updated to use jenkins-slave-boot-image-20211011,
>>>>>> which should have had a go command, but it appears slightly misconfigured.
>>>>>> I reopened BEAM-13037 [1] and added some details there.
>>>>>>
>>>>>> I also added instructions to wiki [2] on how to perform an image swap
>>>>>> and it is actually very straightforward. I think a lesson here is that
>>>>>> making 'lite' upgrades is brittle as misconfigurations could resurface down
>>>>>> the road when the context of the lite upgrade is no longer fresh in our
>>>>>> memory.
>>>>>>
>>>>>> I suggest we revise the instructions to keep only image swap commands
>>>>>> and remove the 'lite' update option. +Daniel Oliveira
>>>>>> <da...@google.com>, WDYT?  In the meantime, we should also
>>>>>> prepare an image that fixes the misconfiguration. Would you be able to help
>>>>>> with that? Thank you.
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-13037
>>>>>> [2]
>>>>>> https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <ro...@frantil.com>
>>>>>> wrote:
>>>>>>
>>>>>>> FYI it looks like all the Go tests are now failing because it can't
>>>>>>> find the Go command at all.
>>>>>>> Did a Jenkins image without Go (v1.16+) pre-installed get pushed?
>>>>>>>
>>>>>>> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev <
>>>>>>> valentyn@google.com> wrote:
>>>>>>>
>>>>>>>> Thanks Daniel,
>>>>>>>>
>>>>>>>> I can recreate the VMs on new disks.
>>>>>>>>
>>>>>>>> We currently have a set of stopped jenkins workers (named:
>>>>>>>> apache-beam-jenkins-##) and running workers (named:
>>>>>>>> apache-ci-beam-jenkins-##)
>>>>>>>>
>>>>>>>> Are there any concerns about deleting the stopped group of workers?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thank you Daniel, Valentyn!
>>>>>>>>>
>>>>>>>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <
>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> I performed a light update of both Go and Python (from Valentyn's
>>>>>>>>>> update) on each worker VM over the weekend. I also added additional
>>>>>>>>>> instructions for the light update to Confluence (as an alternative to the
>>>>>>>>>> current instructions).
>>>>>>>>>>
>>>>>>>>>> There is still reason to perform a full update at some point:
>>>>>>>>>> Valentyn updated the VM image from 500 GB to 1000 GB of storage, which
>>>>>>>>>> requires a full update to actually take effect.
>>>>>>>>>>
>>>>>>>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <
>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> > 3. SSH into the agent and perform the update.
>>>>>>>>>>> So, this would be a 'lite' version of the update, where we make
>>>>>>>>>>> changes to the live worker without recreating worker VM with a new image?
>>>>>>>>>>> We could perhaps document both options, and also make it clear that
>>>>>>>>>>> producing a VM image that has necessary updates is mandatory even if we
>>>>>>>>>>> perform 'lite' updates without recreating the worker.
>>>>>>>>>>> Also, for a lite update, marking the Jenkins offer offline may
>>>>>>>>>>> be optional, as some updates might not be disruptive (such as installing
>>>>>>>>>>> some software that will not be used immediately).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <ro...@frantil.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> SGTM. Thank you very much Daniel!
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you Daniel. Could you please update the wiki once you
>>>>>>>>>>>>> are done with the process?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <
>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Took me a bit to get to this, sorry. I finally figured out an
>>>>>>>>>>>>>> approach for updating Go and did so and will be updating the image
>>>>>>>>>>>>>> momentarily.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think a more important note is that I tried what Valentyn
>>>>>>>>>>>>>> was considering, which is SSHing into workers and updating the dependency.
>>>>>>>>>>>>>> I'll describe the process below, but the summary is that I did it on one
>>>>>>>>>>>>>> worker with Go so far, saw no problems over the weekend, and would like to
>>>>>>>>>>>>>> continue updating the rest of the workers if there are no objections.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here's a step-by-step of what I did. If we decide to stick
>>>>>>>>>>>>>> with this approach, these instructions can be added to Confluence:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. Go to the page for the Jenkins agent you want to update
>>>>>>>>>>>>>> [1] and click "Mark this node temporarily offline", leaving a reason such
>>>>>>>>>>>>>> as "Updating X dependency."
>>>>>>>>>>>>>> 2. Wait until there are no more tests running in that agent
>>>>>>>>>>>>>> (under "Build Executor Status" on the left of the page).
>>>>>>>>>>>>>> 3. SSH into the agent and perform the update.
>>>>>>>>>>>>>> 4. Mark the node as online again.
>>>>>>>>>>>>>> 5. Repeat for every worker.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And these are some additional steps if you want to
>>>>>>>>>>>>>> immediately run a test suite to check that the update worked correctly. For
>>>>>>>>>>>>>> example in my case, I wanted to check against the Go Postcommit, and it was
>>>>>>>>>>>>>> a good thing I did, because it actually failed the first time and I had to
>>>>>>>>>>>>>> go back in to fix a small oversight I made. So doing this after you update
>>>>>>>>>>>>>> your first worker is probably a good idea before updating the rest:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. Go to the page for the job you want to run (for example:
>>>>>>>>>>>>>> [2]).
>>>>>>>>>>>>>> 2. Click "Configure" on the left menu.
>>>>>>>>>>>>>> 3. Find the checkmark "Restrict where this project can be
>>>>>>>>>>>>>> run" and change the restriction from "beam" to the specific name of the
>>>>>>>>>>>>>> agent (ex. "apache-beam-jenkins-1").
>>>>>>>>>>>>>> 4. Save and apply that change.
>>>>>>>>>>>>>> 5. Back on the page for the job, click "Build with
>>>>>>>>>>>>>> Parameters" on the left menu.
>>>>>>>>>>>>>> 6. Run the build on "master".
>>>>>>>>>>>>>> 7. Once you're done checking the results, change
>>>>>>>>>>>>>> the restriction for the job back to "beam". (This also gets reset once
>>>>>>>>>>>>>> every 24 hours in case you forget.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday
>>>>>>>>>>>>>> evening when it wasn't too busy, and got Go updated and working. I checked
>>>>>>>>>>>>>> that agent's execution history again today just in case, and it was healthy
>>>>>>>>>>>>>> over the weekend, with no Go-related problems as far as I could see. If
>>>>>>>>>>>>>> there's no objections I'd like to go ahead and continue updating the rest
>>>>>>>>>>>>>> of the workers (I'll do this late at night or over the weekend to avoid
>>>>>>>>>>>>>> disrupting dev work).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>>>>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I updated the image in [1], but did not change the workers
>>>>>>>>>>>>>>> yet to pick up the new image yet. We can do this once we add Go changes on
>>>>>>>>>>>>>>> top of it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am also considering to SSH into every worker and run a
>>>>>>>>>>>>>>> one-line command that adds the dependency that was missing. It seems to be
>>>>>>>>>>>>>>> low risk, and  there is a fall-back plan to re-start the worker using the
>>>>>>>>>>>>>>> saved image - both new and old images are saved and available in Cloud
>>>>>>>>>>>>>>> Console.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ideally, we should find a way to do a rolling upgrade that a
>>>>>>>>>>>>>>> PMC or committer could trigger without logging into every machine.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Brian Hulette <bh...@google.com> That button seems
>>>>>>>>>>>>>>>> like exactly what we'd need. Doing it manually would be a pain, but it's
>>>>>>>>>>>>>>>> probably still preferable to causing a bunch of aborted tests.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating
>>>>>>>>>>>>>>>> to do both updates at once is a great idea! I'll message you directly about
>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am also interested in this updating version of Python on
>>>>>>>>>>>>>>>>> VMs, I need to install Python 3.9. Thanks for looking into this.  We can
>>>>>>>>>>>>>>>>> coordinate together to make one update instead of two.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <
>>>>>>>>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm not sure about best practices here. Out of curiosity
>>>>>>>>>>>>>>>>>> I just poked around in the Jenkins UI (e.g. [1]) and it looks like you can
>>>>>>>>>>>>>>>>>> manually "Mark node temporarily offline" when logged in (if you're a
>>>>>>>>>>>>>>>>>> committer). According to [2] this will prevent it from picking up new jobs
>>>>>>>>>>>>>>>>>> after it's finished the currently executing ones. Doing that manually for
>>>>>>>>>>>>>>>>>> every worker could be a pain though.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our Jenkins
>>>>>>>>>>>>>>>>>>> VMs, and I found these instructions on upgrading
>>>>>>>>>>>>>>>>>>> software on Jenkins
>>>>>>>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>>>>>>>>>>>>>> our cwiki.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I haven't started going through it yet, but I was
>>>>>>>>>>>>>>>>>>> wondering about the last few steps that involve stopping VMs, deleting boot
>>>>>>>>>>>>>>>>>>> disks, and restarting executors. Is there some best practice for
>>>>>>>>>>>>>>>>>>> that section to avoid causing interruptions in our automated testing?
>>>>>>>>>>>>>>>>>>> Should I be trying to do this outside of peak dev hours, or going one VM at
>>>>>>>>>>>>>>>>>>> a time so others can pick up extra load, or anything like that?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Valentyn Tymofieiev <va...@google.com>.
On Wed, Oct 20, 2021 at 11:16 AM Valentyn Tymofieiev <va...@google.com>
wrote:

>
>
> On Wed, Oct 20, 2021 at 11:12 AM Pablo Estrada <pa...@google.com> wrote:
>
>> Thanks everyone for investigating and documenting this. I'll use it today
>> : )
>>
> Dan may be also in the middle of doing this, please coordinate.
>
>>
>> ahem - maybe we should rename the image name/image family names
>> to jenkins-worker-boot-image ? Does anyone foresee issues if we do that?
>> Does jenkins depend on these names in some undocumented way?
>>
> +1. it should 'just work', need to update the wiki after the change.
> Jenkins also did a terminology adjustment.
>
I had to reimage Jenkins workers again, took care of the rename and changed
the instructions.

I am not sure what is the status of Go Postcommit problem, but noticed that
jenkins worker #1 had a different boot disk. I reimaged all workers
building on top of the latest image from the image family. If Go tests
start failing, we may need to get help from Dan again.


>
>> On Tue, Oct 19, 2021 at 1:43 PM Daniel Oliveira <da...@google.com>
>> wrote:
>>
>>> I'm ok with deciding to avoid the "lite" update option, feel free to
>>> revise the instructions as it seems appropriate. As for the issue, I fixed
>>> it with a workaround that should work until we need to add a new image to
>>> the agents, and I'm currently investigating the root cause and prepare a
>>> fixed image.
>>>
>>> That said, I think this issue would have still happened even if we
>>> didn't perform the "lite" update. I'm still trying to figure out the exact
>>> problem, but it looks to be a PATH issue that wasn't effectively caught by
>>> the current process. I won't get into details too much in this thread (see
>>> the Jira for that), but essentially everything works in my environment when
>>> I SSH into the VMs, but because the location of the "go" command changed in
>>> the PATH, it seems to have stopped working for every other user, including
>>> the Jenkins agents. I actually did notice that would happen when I was
>>> working on the image, but the solution seemed to be to reboot the machine,
>>> which I assumed happened already since I shut down the VM to image it.
>>>
>>> On Tue, Oct 19, 2021 at 12:09 PM Robert Burke <ro...@frantil.com>
>>> wrote:
>>>
>>>> +1 to only having one way to do things. The Lite option seems liable to
>>>> cause more problems since it means it's changes can be blown away if a new
>>>> image isn't prepared anyway.
>>>> I don't think we are changing the images often enough for it.  Perhaps
>>>> call it the option to test changes if anything?
>>>>
>>>> On Tue, Oct 19, 2021, 11:55 AM Valentyn Tymofieiev <va...@google.com>
>>>> wrote:
>>>>
>>>>> All workers were updated to use jenkins-slave-boot-image-20211011,
>>>>> which should have had a go command, but it appears slightly misconfigured.
>>>>> I reopened BEAM-13037 [1] and added some details there.
>>>>>
>>>>> I also added instructions to wiki [2] on how to perform an image swap
>>>>> and it is actually very straightforward. I think a lesson here is that
>>>>> making 'lite' upgrades is brittle as misconfigurations could resurface down
>>>>> the road when the context of the lite upgrade is no longer fresh in our
>>>>> memory.
>>>>>
>>>>> I suggest we revise the instructions to keep only image swap commands
>>>>> and remove the 'lite' update option. +Daniel Oliveira
>>>>> <da...@google.com>, WDYT?  In the meantime, we should also
>>>>> prepare an image that fixes the misconfiguration. Would you be able to help
>>>>> with that? Thank you.
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/BEAM-13037
>>>>> [2]
>>>>> https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers
>>>>>
>>>>>
>>>>> On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <ro...@frantil.com>
>>>>> wrote:
>>>>>
>>>>>> FYI it looks like all the Go tests are now failing because it can't
>>>>>> find the Go command at all.
>>>>>> Did a Jenkins image without Go (v1.16+) pre-installed get pushed?
>>>>>>
>>>>>> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev <
>>>>>> valentyn@google.com> wrote:
>>>>>>
>>>>>>> Thanks Daniel,
>>>>>>>
>>>>>>> I can recreate the VMs on new disks.
>>>>>>>
>>>>>>> We currently have a set of stopped jenkins workers (named:
>>>>>>> apache-beam-jenkins-##) and running workers (named:
>>>>>>> apache-ci-beam-jenkins-##)
>>>>>>>
>>>>>>> Are there any concerns about deleting the stopped group of workers?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thank you Daniel, Valentyn!
>>>>>>>>
>>>>>>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <
>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>
>>>>>>>>> I performed a light update of both Go and Python (from Valentyn's
>>>>>>>>> update) on each worker VM over the weekend. I also added additional
>>>>>>>>> instructions for the light update to Confluence (as an alternative to the
>>>>>>>>> current instructions).
>>>>>>>>>
>>>>>>>>> There is still reason to perform a full update at some point:
>>>>>>>>> Valentyn updated the VM image from 500 GB to 1000 GB of storage, which
>>>>>>>>> requires a full update to actually take effect.
>>>>>>>>>
>>>>>>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <
>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> > 3. SSH into the agent and perform the update.
>>>>>>>>>> So, this would be a 'lite' version of the update, where we make
>>>>>>>>>> changes to the live worker without recreating worker VM with a new image?
>>>>>>>>>> We could perhaps document both options, and also make it clear that
>>>>>>>>>> producing a VM image that has necessary updates is mandatory even if we
>>>>>>>>>> perform 'lite' updates without recreating the worker.
>>>>>>>>>> Also, for a lite update, marking the Jenkins offer offline may be
>>>>>>>>>> optional, as some updates might not be disruptive (such as installing some
>>>>>>>>>> software that will not be used immediately).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <ro...@frantil.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> SGTM. Thank you very much Daniel!
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thank you Daniel. Could you please update the wiki once you are
>>>>>>>>>>>> done with the process?
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <
>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Took me a bit to get to this, sorry. I finally figured out an
>>>>>>>>>>>>> approach for updating Go and did so and will be updating the image
>>>>>>>>>>>>> momentarily.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think a more important note is that I tried what Valentyn
>>>>>>>>>>>>> was considering, which is SSHing into workers and updating the dependency.
>>>>>>>>>>>>> I'll describe the process below, but the summary is that I did it on one
>>>>>>>>>>>>> worker with Go so far, saw no problems over the weekend, and would like to
>>>>>>>>>>>>> continue updating the rest of the workers if there are no objections.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here's a step-by-step of what I did. If we decide to stick
>>>>>>>>>>>>> with this approach, these instructions can be added to Confluence:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Go to the page for the Jenkins agent you want to update [1]
>>>>>>>>>>>>> and click "Mark this node temporarily offline", leaving a reason such as
>>>>>>>>>>>>> "Updating X dependency."
>>>>>>>>>>>>> 2. Wait until there are no more tests running in that agent
>>>>>>>>>>>>> (under "Build Executor Status" on the left of the page).
>>>>>>>>>>>>> 3. SSH into the agent and perform the update.
>>>>>>>>>>>>> 4. Mark the node as online again.
>>>>>>>>>>>>> 5. Repeat for every worker.
>>>>>>>>>>>>>
>>>>>>>>>>>>> And these are some additional steps if you want to immediately
>>>>>>>>>>>>> run a test suite to check that the update worked correctly. For example in
>>>>>>>>>>>>> my case, I wanted to check against the Go Postcommit, and it was a good
>>>>>>>>>>>>> thing I did, because it actually failed the first time and I had to go back
>>>>>>>>>>>>> in to fix a small oversight I made. So doing this after you update your
>>>>>>>>>>>>> first worker is probably a good idea before updating the rest:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Go to the page for the job you want to run (for example:
>>>>>>>>>>>>> [2]).
>>>>>>>>>>>>> 2. Click "Configure" on the left menu.
>>>>>>>>>>>>> 3. Find the checkmark "Restrict where this project can be run"
>>>>>>>>>>>>> and change the restriction from "beam" to the specific name of the agent
>>>>>>>>>>>>> (ex. "apache-beam-jenkins-1").
>>>>>>>>>>>>> 4. Save and apply that change.
>>>>>>>>>>>>> 5. Back on the page for the job, click "Build with Parameters"
>>>>>>>>>>>>> on the left menu.
>>>>>>>>>>>>> 6. Run the build on "master".
>>>>>>>>>>>>> 7. Once you're done checking the results, change
>>>>>>>>>>>>> the restriction for the job back to "beam". (This also gets reset once
>>>>>>>>>>>>> every 24 hours in case you forget.)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday
>>>>>>>>>>>>> evening when it wasn't too busy, and got Go updated and working. I checked
>>>>>>>>>>>>> that agent's execution history again today just in case, and it was healthy
>>>>>>>>>>>>> over the weekend, with no Go-related problems as far as I could see. If
>>>>>>>>>>>>> there's no objections I'd like to go ahead and continue updating the rest
>>>>>>>>>>>>> of the workers (I'll do this late at night or over the weekend to avoid
>>>>>>>>>>>>> disrupting dev work).
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>>>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I updated the image in [1], but did not change the workers
>>>>>>>>>>>>>> yet to pick up the new image yet. We can do this once we add Go changes on
>>>>>>>>>>>>>> top of it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am also considering to SSH into every worker and run a
>>>>>>>>>>>>>> one-line command that adds the dependency that was missing. It seems to be
>>>>>>>>>>>>>> low risk, and  there is a fall-back plan to re-start the worker using the
>>>>>>>>>>>>>> saved image - both new and old images are saved and available in Cloud
>>>>>>>>>>>>>> Console.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ideally, we should find a way to do a rolling upgrade that a
>>>>>>>>>>>>>> PMC or committer could trigger without logging into every machine.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Brian Hulette <bh...@google.com> That button seems like
>>>>>>>>>>>>>>> exactly what we'd need. Doing it manually would be a pain, but it's
>>>>>>>>>>>>>>> probably still preferable to causing a bunch of aborted tests.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating to
>>>>>>>>>>>>>>> do both updates at once is a great idea! I'll message you directly about it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am also interested in this updating version of Python on
>>>>>>>>>>>>>>>> VMs, I need to install Python 3.9. Thanks for looking into this.  We can
>>>>>>>>>>>>>>>> coordinate together to make one update instead of two.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <
>>>>>>>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm not sure about best practices here. Out of curiosity I
>>>>>>>>>>>>>>>>> just poked around in the Jenkins UI (e.g. [1]) and it looks like you can
>>>>>>>>>>>>>>>>> manually "Mark node temporarily offline" when logged in (if you're a
>>>>>>>>>>>>>>>>> committer). According to [2] this will prevent it from picking up new jobs
>>>>>>>>>>>>>>>>> after it's finished the currently executing ones. Doing that manually for
>>>>>>>>>>>>>>>>> every worker could be a pain though.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our Jenkins
>>>>>>>>>>>>>>>>>> VMs, and I found these instructions on upgrading
>>>>>>>>>>>>>>>>>> software on Jenkins
>>>>>>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>>>>>>>>>>>>> our cwiki.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I haven't started going through it yet, but I was
>>>>>>>>>>>>>>>>>> wondering about the last few steps that involve stopping VMs, deleting boot
>>>>>>>>>>>>>>>>>> disks, and restarting executors. Is there some best practice for
>>>>>>>>>>>>>>>>>> that section to avoid causing interruptions in our automated testing?
>>>>>>>>>>>>>>>>>> Should I be trying to do this outside of peak dev hours, or going one VM at
>>>>>>>>>>>>>>>>>> a time so others can pick up extra load, or anything like that?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Valentyn Tymofieiev <va...@google.com>.
On Wed, Oct 20, 2021 at 11:12 AM Pablo Estrada <pa...@google.com> wrote:

> Thanks everyone for investigating and documenting this. I'll use it today
> : )
>
Dan may be also in the middle of doing this, please coordinate.

>
> ahem - maybe we should rename the image name/image family names
> to jenkins-worker-boot-image ? Does anyone foresee issues if we do that?
> Does jenkins depend on these names in some undocumented way?
>
+1. it should 'just work', need to update the wiki after the change.
Jenkins also did a terminology adjustment.

>
> On Tue, Oct 19, 2021 at 1:43 PM Daniel Oliveira <da...@google.com>
> wrote:
>
>> I'm ok with deciding to avoid the "lite" update option, feel free to
>> revise the instructions as it seems appropriate. As for the issue, I fixed
>> it with a workaround that should work until we need to add a new image to
>> the agents, and I'm currently investigating the root cause and prepare a
>> fixed image.
>>
>> That said, I think this issue would have still happened even if we didn't
>> perform the "lite" update. I'm still trying to figure out the exact
>> problem, but it looks to be a PATH issue that wasn't effectively caught by
>> the current process. I won't get into details too much in this thread (see
>> the Jira for that), but essentially everything works in my environment when
>> I SSH into the VMs, but because the location of the "go" command changed in
>> the PATH, it seems to have stopped working for every other user, including
>> the Jenkins agents. I actually did notice that would happen when I was
>> working on the image, but the solution seemed to be to reboot the machine,
>> which I assumed happened already since I shut down the VM to image it.
>>
>> On Tue, Oct 19, 2021 at 12:09 PM Robert Burke <ro...@frantil.com> wrote:
>>
>>> +1 to only having one way to do things. The Lite option seems liable to
>>> cause more problems since it means it's changes can be blown away if a new
>>> image isn't prepared anyway.
>>> I don't think we are changing the images often enough for it.  Perhaps
>>> call it the option to test changes if anything?
>>>
>>> On Tue, Oct 19, 2021, 11:55 AM Valentyn Tymofieiev <va...@google.com>
>>> wrote:
>>>
>>>> All workers were updated to use jenkins-slave-boot-image-20211011,
>>>> which should have had a go command, but it appears slightly misconfigured.
>>>> I reopened BEAM-13037 [1] and added some details there.
>>>>
>>>> I also added instructions to wiki [2] on how to perform an image swap
>>>> and it is actually very straightforward. I think a lesson here is that
>>>> making 'lite' upgrades is brittle as misconfigurations could resurface down
>>>> the road when the context of the lite upgrade is no longer fresh in our
>>>> memory.
>>>>
>>>> I suggest we revise the instructions to keep only image swap commands
>>>> and remove the 'lite' update option. +Daniel Oliveira
>>>> <da...@google.com>, WDYT?  In the meantime, we should also
>>>> prepare an image that fixes the misconfiguration. Would you be able to help
>>>> with that? Thank you.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/BEAM-13037
>>>> [2]
>>>> https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers
>>>>
>>>>
>>>> On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <ro...@frantil.com>
>>>> wrote:
>>>>
>>>>> FYI it looks like all the Go tests are now failing because it can't
>>>>> find the Go command at all.
>>>>> Did a Jenkins image without Go (v1.16+) pre-installed get pushed?
>>>>>
>>>>> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev <va...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Daniel,
>>>>>>
>>>>>> I can recreate the VMs on new disks.
>>>>>>
>>>>>> We currently have a set of stopped jenkins workers (named:
>>>>>> apache-beam-jenkins-##) and running workers (named:
>>>>>> apache-ci-beam-jenkins-##)
>>>>>>
>>>>>> Are there any concerns about deleting the stopped group of workers?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thank you Daniel, Valentyn!
>>>>>>>
>>>>>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <
>>>>>>> danoliveira@google.com> wrote:
>>>>>>>
>>>>>>>> I performed a light update of both Go and Python (from Valentyn's
>>>>>>>> update) on each worker VM over the weekend. I also added additional
>>>>>>>> instructions for the light update to Confluence (as an alternative to the
>>>>>>>> current instructions).
>>>>>>>>
>>>>>>>> There is still reason to perform a full update at some point:
>>>>>>>> Valentyn updated the VM image from 500 GB to 1000 GB of storage, which
>>>>>>>> requires a full update to actually take effect.
>>>>>>>>
>>>>>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <
>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>
>>>>>>>>> > 3. SSH into the agent and perform the update.
>>>>>>>>> So, this would be a 'lite' version of the update, where we make
>>>>>>>>> changes to the live worker without recreating worker VM with a new image?
>>>>>>>>> We could perhaps document both options, and also make it clear that
>>>>>>>>> producing a VM image that has necessary updates is mandatory even if we
>>>>>>>>> perform 'lite' updates without recreating the worker.
>>>>>>>>> Also, for a lite update, marking the Jenkins offer offline may be
>>>>>>>>> optional, as some updates might not be disruptive (such as installing some
>>>>>>>>> software that will not be used immediately).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <ro...@frantil.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> SGTM. Thank you very much Daniel!
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you Daniel. Could you please update the wiki once you are
>>>>>>>>>>> done with the process?
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <
>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Took me a bit to get to this, sorry. I finally figured out an
>>>>>>>>>>>> approach for updating Go and did so and will be updating the image
>>>>>>>>>>>> momentarily.
>>>>>>>>>>>>
>>>>>>>>>>>> I think a more important note is that I tried what Valentyn was
>>>>>>>>>>>> considering, which is SSHing into workers and updating the dependency. I'll
>>>>>>>>>>>> describe the process below, but the summary is that I did it on one worker
>>>>>>>>>>>> with Go so far, saw no problems over the weekend, and would like to
>>>>>>>>>>>> continue updating the rest of the workers if there are no objections.
>>>>>>>>>>>>
>>>>>>>>>>>> Here's a step-by-step of what I did. If we decide to stick with
>>>>>>>>>>>> this approach, these instructions can be added to Confluence:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Go to the page for the Jenkins agent you want to update [1]
>>>>>>>>>>>> and click "Mark this node temporarily offline", leaving a reason such as
>>>>>>>>>>>> "Updating X dependency."
>>>>>>>>>>>> 2. Wait until there are no more tests running in that agent
>>>>>>>>>>>> (under "Build Executor Status" on the left of the page).
>>>>>>>>>>>> 3. SSH into the agent and perform the update.
>>>>>>>>>>>> 4. Mark the node as online again.
>>>>>>>>>>>> 5. Repeat for every worker.
>>>>>>>>>>>>
>>>>>>>>>>>> And these are some additional steps if you want to immediately
>>>>>>>>>>>> run a test suite to check that the update worked correctly. For example in
>>>>>>>>>>>> my case, I wanted to check against the Go Postcommit, and it was a good
>>>>>>>>>>>> thing I did, because it actually failed the first time and I had to go back
>>>>>>>>>>>> in to fix a small oversight I made. So doing this after you update your
>>>>>>>>>>>> first worker is probably a good idea before updating the rest:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Go to the page for the job you want to run (for example:
>>>>>>>>>>>> [2]).
>>>>>>>>>>>> 2. Click "Configure" on the left menu.
>>>>>>>>>>>> 3. Find the checkmark "Restrict where this project can be run"
>>>>>>>>>>>> and change the restriction from "beam" to the specific name of the agent
>>>>>>>>>>>> (ex. "apache-beam-jenkins-1").
>>>>>>>>>>>> 4. Save and apply that change.
>>>>>>>>>>>> 5. Back on the page for the job, click "Build with Parameters"
>>>>>>>>>>>> on the left menu.
>>>>>>>>>>>> 6. Run the build on "master".
>>>>>>>>>>>> 7. Once you're done checking the results, change
>>>>>>>>>>>> the restriction for the job back to "beam". (This also gets reset once
>>>>>>>>>>>> every 24 hours in case you forget.)
>>>>>>>>>>>>
>>>>>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday
>>>>>>>>>>>> evening when it wasn't too busy, and got Go updated and working. I checked
>>>>>>>>>>>> that agent's execution history again today just in case, and it was healthy
>>>>>>>>>>>> over the weekend, with no Go-related problems as far as I could see. If
>>>>>>>>>>>> there's no objections I'd like to go ahead and continue updating the rest
>>>>>>>>>>>> of the workers (I'll do this late at night or over the weekend to avoid
>>>>>>>>>>>> disrupting dev work).
>>>>>>>>>>>>
>>>>>>>>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I updated the image in [1], but did not change the workers yet
>>>>>>>>>>>>> to pick up the new image yet. We can do this once we add Go changes on top
>>>>>>>>>>>>> of it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am also considering to SSH into every worker and run a
>>>>>>>>>>>>> one-line command that adds the dependency that was missing. It seems to be
>>>>>>>>>>>>> low risk, and  there is a fall-back plan to re-start the worker using the
>>>>>>>>>>>>> saved image - both new and old images are saved and available in Cloud
>>>>>>>>>>>>> Console.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ideally, we should find a way to do a rolling upgrade that a
>>>>>>>>>>>>> PMC or committer could trigger without logging into every machine.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Brian Hulette <bh...@google.com> That button seems like
>>>>>>>>>>>>>> exactly what we'd need. Doing it manually would be a pain, but it's
>>>>>>>>>>>>>> probably still preferable to causing a bunch of aborted tests.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating to
>>>>>>>>>>>>>> do both updates at once is a great idea! I'll message you directly about it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am also interested in this updating version of Python on
>>>>>>>>>>>>>>> VMs, I need to install Python 3.9. Thanks for looking into this.  We can
>>>>>>>>>>>>>>> coordinate together to make one update instead of two.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <
>>>>>>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not sure about best practices here. Out of curiosity I
>>>>>>>>>>>>>>>> just poked around in the Jenkins UI (e.g. [1]) and it looks like you can
>>>>>>>>>>>>>>>> manually "Mark node temporarily offline" when logged in (if you're a
>>>>>>>>>>>>>>>> committer). According to [2] this will prevent it from picking up new jobs
>>>>>>>>>>>>>>>> after it's finished the currently executing ones. Doing that manually for
>>>>>>>>>>>>>>>> every worker could be a pain though.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our Jenkins
>>>>>>>>>>>>>>>>> VMs, and I found these instructions on upgrading software
>>>>>>>>>>>>>>>>> on Jenkins
>>>>>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>>>>>>>>>>>> our cwiki.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I haven't started going through it yet, but I was
>>>>>>>>>>>>>>>>> wondering about the last few steps that involve stopping VMs, deleting boot
>>>>>>>>>>>>>>>>> disks, and restarting executors. Is there some best practice for
>>>>>>>>>>>>>>>>> that section to avoid causing interruptions in our automated testing?
>>>>>>>>>>>>>>>>> Should I be trying to do this outside of peak dev hours, or going one VM at
>>>>>>>>>>>>>>>>> a time so others can pick up extra load, or anything like that?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Pablo Estrada <pa...@google.com>.
Thanks everyone for investigating and documenting this. I'll use it today :
)

ahem - maybe we should rename the image name/image family names
to jenkins-worker-boot-image ? Does anyone foresee issues if we do that?
Does jenkins depend on these names in some undocumented way?

On Tue, Oct 19, 2021 at 1:43 PM Daniel Oliveira <da...@google.com>
wrote:

> I'm ok with deciding to avoid the "lite" update option, feel free to
> revise the instructions as it seems appropriate. As for the issue, I fixed
> it with a workaround that should work until we need to add a new image to
> the agents, and I'm currently investigating the root cause and prepare a
> fixed image.
>
> That said, I think this issue would have still happened even if we didn't
> perform the "lite" update. I'm still trying to figure out the exact
> problem, but it looks to be a PATH issue that wasn't effectively caught by
> the current process. I won't get into details too much in this thread (see
> the Jira for that), but essentially everything works in my environment when
> I SSH into the VMs, but because the location of the "go" command changed in
> the PATH, it seems to have stopped working for every other user, including
> the Jenkins agents. I actually did notice that would happen when I was
> working on the image, but the solution seemed to be to reboot the machine,
> which I assumed happened already since I shut down the VM to image it.
>
> On Tue, Oct 19, 2021 at 12:09 PM Robert Burke <ro...@frantil.com> wrote:
>
>> +1 to only having one way to do things. The Lite option seems liable to
>> cause more problems since it means it's changes can be blown away if a new
>> image isn't prepared anyway.
>> I don't think we are changing the images often enough for it.  Perhaps
>> call it the option to test changes if anything?
>>
>> On Tue, Oct 19, 2021, 11:55 AM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>> All workers were updated to use jenkins-slave-boot-image-20211011, which
>>> should have had a go command, but it appears slightly misconfigured. I
>>> reopened BEAM-13037 [1] and added some details there.
>>>
>>> I also added instructions to wiki [2] on how to perform an image swap
>>> and it is actually very straightforward. I think a lesson here is that
>>> making 'lite' upgrades is brittle as misconfigurations could resurface down
>>> the road when the context of the lite upgrade is no longer fresh in our
>>> memory.
>>>
>>> I suggest we revise the instructions to keep only image swap commands
>>> and remove the 'lite' update option. +Daniel Oliveira
>>> <da...@google.com>, WDYT?  In the meantime, we should also
>>> prepare an image that fixes the misconfiguration. Would you be able to help
>>> with that? Thank you.
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-13037
>>> [2]
>>> https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers
>>>
>>>
>>> On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <ro...@frantil.com> wrote:
>>>
>>>> FYI it looks like all the Go tests are now failing because it can't
>>>> find the Go command at all.
>>>> Did a Jenkins image without Go (v1.16+) pre-installed get pushed?
>>>>
>>>> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev <va...@google.com>
>>>> wrote:
>>>>
>>>>> Thanks Daniel,
>>>>>
>>>>> I can recreate the VMs on new disks.
>>>>>
>>>>> We currently have a set of stopped jenkins workers (named:
>>>>> apache-beam-jenkins-##) and running workers (named:
>>>>> apache-ci-beam-jenkins-##)
>>>>>
>>>>> Are there any concerns about deleting the stopped group of workers?
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com> wrote:
>>>>>
>>>>>> Thank you Daniel, Valentyn!
>>>>>>
>>>>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <
>>>>>> danoliveira@google.com> wrote:
>>>>>>
>>>>>>> I performed a light update of both Go and Python (from Valentyn's
>>>>>>> update) on each worker VM over the weekend. I also added additional
>>>>>>> instructions for the light update to Confluence (as an alternative to the
>>>>>>> current instructions).
>>>>>>>
>>>>>>> There is still reason to perform a full update at some point:
>>>>>>> Valentyn updated the VM image from 500 GB to 1000 GB of storage, which
>>>>>>> requires a full update to actually take effect.
>>>>>>>
>>>>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <
>>>>>>> valentyn@google.com> wrote:
>>>>>>>
>>>>>>>> > 3. SSH into the agent and perform the update.
>>>>>>>> So, this would be a 'lite' version of the update, where we make
>>>>>>>> changes to the live worker without recreating worker VM with a new image?
>>>>>>>> We could perhaps document both options, and also make it clear that
>>>>>>>> producing a VM image that has necessary updates is mandatory even if we
>>>>>>>> perform 'lite' updates without recreating the worker.
>>>>>>>> Also, for a lite update, marking the Jenkins offer offline may be
>>>>>>>> optional, as some updates might not be disruptive (such as installing some
>>>>>>>> software that will not be used immediately).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <ro...@frantil.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> SGTM. Thank you very much Daniel!
>>>>>>>>>
>>>>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thank you Daniel. Could you please update the wiki once you are
>>>>>>>>>> done with the process?
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <
>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Took me a bit to get to this, sorry. I finally figured out an
>>>>>>>>>>> approach for updating Go and did so and will be updating the image
>>>>>>>>>>> momentarily.
>>>>>>>>>>>
>>>>>>>>>>> I think a more important note is that I tried what Valentyn was
>>>>>>>>>>> considering, which is SSHing into workers and updating the dependency. I'll
>>>>>>>>>>> describe the process below, but the summary is that I did it on one worker
>>>>>>>>>>> with Go so far, saw no problems over the weekend, and would like to
>>>>>>>>>>> continue updating the rest of the workers if there are no objections.
>>>>>>>>>>>
>>>>>>>>>>> Here's a step-by-step of what I did. If we decide to stick with
>>>>>>>>>>> this approach, these instructions can be added to Confluence:
>>>>>>>>>>>
>>>>>>>>>>> 1. Go to the page for the Jenkins agent you want to update [1]
>>>>>>>>>>> and click "Mark this node temporarily offline", leaving a reason such as
>>>>>>>>>>> "Updating X dependency."
>>>>>>>>>>> 2. Wait until there are no more tests running in that agent
>>>>>>>>>>> (under "Build Executor Status" on the left of the page).
>>>>>>>>>>> 3. SSH into the agent and perform the update.
>>>>>>>>>>> 4. Mark the node as online again.
>>>>>>>>>>> 5. Repeat for every worker.
>>>>>>>>>>>
>>>>>>>>>>> And these are some additional steps if you want to immediately
>>>>>>>>>>> run a test suite to check that the update worked correctly. For example in
>>>>>>>>>>> my case, I wanted to check against the Go Postcommit, and it was a good
>>>>>>>>>>> thing I did, because it actually failed the first time and I had to go back
>>>>>>>>>>> in to fix a small oversight I made. So doing this after you update your
>>>>>>>>>>> first worker is probably a good idea before updating the rest:
>>>>>>>>>>>
>>>>>>>>>>> 1. Go to the page for the job you want to run (for example: [2]).
>>>>>>>>>>> 2. Click "Configure" on the left menu.
>>>>>>>>>>> 3. Find the checkmark "Restrict where this project can be run"
>>>>>>>>>>> and change the restriction from "beam" to the specific name of the agent
>>>>>>>>>>> (ex. "apache-beam-jenkins-1").
>>>>>>>>>>> 4. Save and apply that change.
>>>>>>>>>>> 5. Back on the page for the job, click "Build with Parameters"
>>>>>>>>>>> on the left menu.
>>>>>>>>>>> 6. Run the build on "master".
>>>>>>>>>>> 7. Once you're done checking the results, change the restriction
>>>>>>>>>>> for the job back to "beam". (This also gets reset once every 24 hours in
>>>>>>>>>>> case you forget.)
>>>>>>>>>>>
>>>>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday
>>>>>>>>>>> evening when it wasn't too busy, and got Go updated and working. I checked
>>>>>>>>>>> that agent's execution history again today just in case, and it was healthy
>>>>>>>>>>> over the weekend, with no Go-related problems as far as I could see. If
>>>>>>>>>>> there's no objections I'd like to go ahead and continue updating the rest
>>>>>>>>>>> of the workers (I'll do this late at night or over the weekend to avoid
>>>>>>>>>>> disrupting dev work).
>>>>>>>>>>>
>>>>>>>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I updated the image in [1], but did not change the workers yet
>>>>>>>>>>>> to pick up the new image yet. We can do this once we add Go changes on top
>>>>>>>>>>>> of it.
>>>>>>>>>>>>
>>>>>>>>>>>> I am also considering to SSH into every worker and run a
>>>>>>>>>>>> one-line command that adds the dependency that was missing. It seems to be
>>>>>>>>>>>> low risk, and  there is a fall-back plan to re-start the worker using the
>>>>>>>>>>>> saved image - both new and old images are saved and available in Cloud
>>>>>>>>>>>> Console.
>>>>>>>>>>>>
>>>>>>>>>>>> Ideally, we should find a way to do a rolling upgrade that a
>>>>>>>>>>>> PMC or committer could trigger without logging into every machine.
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> @Brian Hulette <bh...@google.com> That button seems like
>>>>>>>>>>>>> exactly what we'd need. Doing it manually would be a pain, but it's
>>>>>>>>>>>>> probably still preferable to causing a bunch of aborted tests.
>>>>>>>>>>>>>
>>>>>>>>>>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating to
>>>>>>>>>>>>> do both updates at once is a great idea! I'll message you directly about it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am also interested in this updating version of Python on
>>>>>>>>>>>>>> VMs, I need to install Python 3.9. Thanks for looking into this.  We can
>>>>>>>>>>>>>> coordinate together to make one update instead of two.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <
>>>>>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not sure about best practices here. Out of curiosity I
>>>>>>>>>>>>>>> just poked around in the Jenkins UI (e.g. [1]) and it looks like you can
>>>>>>>>>>>>>>> manually "Mark node temporarily offline" when logged in (if you're a
>>>>>>>>>>>>>>> committer). According to [2] this will prevent it from picking up new jobs
>>>>>>>>>>>>>>> after it's finished the currently executing ones. Doing that manually for
>>>>>>>>>>>>>>> every worker could be a pain though.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our Jenkins
>>>>>>>>>>>>>>>> VMs, and I found these instructions on upgrading software
>>>>>>>>>>>>>>>> on Jenkins
>>>>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>>>>>>>>>>> our cwiki.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I haven't started going through it yet, but I was wondering
>>>>>>>>>>>>>>>> about the last few steps that involve stopping VMs, deleting boot disks,
>>>>>>>>>>>>>>>> and restarting executors. Is there some best practice for that section to
>>>>>>>>>>>>>>>> avoid causing interruptions in our automated testing? Should I be trying to
>>>>>>>>>>>>>>>> do this outside of peak dev hours, or going one VM at a time so others can
>>>>>>>>>>>>>>>> pick up extra load, or anything like that?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Daniel Oliveira <da...@google.com>.
I'm ok with deciding to avoid the "lite" update option, feel free to revise
the instructions as it seems appropriate. As for the issue, I fixed it with
a workaround that should work until we need to add a new image to the
agents, and I'm currently investigating the root cause and prepare a fixed
image.

That said, I think this issue would have still happened even if we didn't
perform the "lite" update. I'm still trying to figure out the exact
problem, but it looks to be a PATH issue that wasn't effectively caught by
the current process. I won't get into details too much in this thread (see
the Jira for that), but essentially everything works in my environment when
I SSH into the VMs, but because the location of the "go" command changed in
the PATH, it seems to have stopped working for every other user, including
the Jenkins agents. I actually did notice that would happen when I was
working on the image, but the solution seemed to be to reboot the machine,
which I assumed happened already since I shut down the VM to image it.

On Tue, Oct 19, 2021 at 12:09 PM Robert Burke <ro...@frantil.com> wrote:

> +1 to only having one way to do things. The Lite option seems liable to
> cause more problems since it means it's changes can be blown away if a new
> image isn't prepared anyway.
> I don't think we are changing the images often enough for it.  Perhaps
> call it the option to test changes if anything?
>
> On Tue, Oct 19, 2021, 11:55 AM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> All workers were updated to use jenkins-slave-boot-image-20211011, which
>> should have had a go command, but it appears slightly misconfigured. I
>> reopened BEAM-13037 [1] and added some details there.
>>
>> I also added instructions to wiki [2] on how to perform an image swap and
>> it is actually very straightforward. I think a lesson here is that making
>> 'lite' upgrades is brittle as misconfigurations could resurface down the
>> road when the context of the lite upgrade is no longer fresh in our memory.
>>
>> I suggest we revise the instructions to keep only image swap commands and
>> remove the 'lite' update option. +Daniel Oliveira
>> <da...@google.com>, WDYT?  In the meantime, we should also prepare
>> an image that fixes the misconfiguration. Would you be able to help with
>> that? Thank you.
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-13037
>> [2]
>> https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers
>>
>>
>> On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <ro...@frantil.com> wrote:
>>
>>> FYI it looks like all the Go tests are now failing because it can't find
>>> the Go command at all.
>>> Did a Jenkins image without Go (v1.16+) pre-installed get pushed?
>>>
>>> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev <va...@google.com>
>>> wrote:
>>>
>>>> Thanks Daniel,
>>>>
>>>> I can recreate the VMs on new disks.
>>>>
>>>> We currently have a set of stopped jenkins workers (named:
>>>> apache-beam-jenkins-##) and running workers (named:
>>>> apache-ci-beam-jenkins-##)
>>>>
>>>> Are there any concerns about deleting the stopped group of workers?
>>>>
>>>>
>>>>
>>>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com> wrote:
>>>>
>>>>> Thank you Daniel, Valentyn!
>>>>>
>>>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <
>>>>> danoliveira@google.com> wrote:
>>>>>
>>>>>> I performed a light update of both Go and Python (from Valentyn's
>>>>>> update) on each worker VM over the weekend. I also added additional
>>>>>> instructions for the light update to Confluence (as an alternative to the
>>>>>> current instructions).
>>>>>>
>>>>>> There is still reason to perform a full update at some point:
>>>>>> Valentyn updated the VM image from 500 GB to 1000 GB of storage, which
>>>>>> requires a full update to actually take effect.
>>>>>>
>>>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <
>>>>>> valentyn@google.com> wrote:
>>>>>>
>>>>>>> > 3. SSH into the agent and perform the update.
>>>>>>> So, this would be a 'lite' version of the update, where we make
>>>>>>> changes to the live worker without recreating worker VM with a new image?
>>>>>>> We could perhaps document both options, and also make it clear that
>>>>>>> producing a VM image that has necessary updates is mandatory even if we
>>>>>>> perform 'lite' updates without recreating the worker.
>>>>>>> Also, for a lite update, marking the Jenkins offer offline may be
>>>>>>> optional, as some updates might not be disruptive (such as installing some
>>>>>>> software that will not be used immediately).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <ro...@frantil.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> SGTM. Thank you very much Daniel!
>>>>>>>>
>>>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com> wrote:
>>>>>>>>
>>>>>>>>> Thank you Daniel. Could you please update the wiki once you are
>>>>>>>>> done with the process?
>>>>>>>>>
>>>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <
>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> Took me a bit to get to this, sorry. I finally figured out an
>>>>>>>>>> approach for updating Go and did so and will be updating the image
>>>>>>>>>> momentarily.
>>>>>>>>>>
>>>>>>>>>> I think a more important note is that I tried what Valentyn was
>>>>>>>>>> considering, which is SSHing into workers and updating the dependency. I'll
>>>>>>>>>> describe the process below, but the summary is that I did it on one worker
>>>>>>>>>> with Go so far, saw no problems over the weekend, and would like to
>>>>>>>>>> continue updating the rest of the workers if there are no objections.
>>>>>>>>>>
>>>>>>>>>> Here's a step-by-step of what I did. If we decide to stick with
>>>>>>>>>> this approach, these instructions can be added to Confluence:
>>>>>>>>>>
>>>>>>>>>> 1. Go to the page for the Jenkins agent you want to update [1]
>>>>>>>>>> and click "Mark this node temporarily offline", leaving a reason such as
>>>>>>>>>> "Updating X dependency."
>>>>>>>>>> 2. Wait until there are no more tests running in that agent
>>>>>>>>>> (under "Build Executor Status" on the left of the page).
>>>>>>>>>> 3. SSH into the agent and perform the update.
>>>>>>>>>> 4. Mark the node as online again.
>>>>>>>>>> 5. Repeat for every worker.
>>>>>>>>>>
>>>>>>>>>> And these are some additional steps if you want to immediately
>>>>>>>>>> run a test suite to check that the update worked correctly. For example in
>>>>>>>>>> my case, I wanted to check against the Go Postcommit, and it was a good
>>>>>>>>>> thing I did, because it actually failed the first time and I had to go back
>>>>>>>>>> in to fix a small oversight I made. So doing this after you update your
>>>>>>>>>> first worker is probably a good idea before updating the rest:
>>>>>>>>>>
>>>>>>>>>> 1. Go to the page for the job you want to run (for example: [2]).
>>>>>>>>>> 2. Click "Configure" on the left menu.
>>>>>>>>>> 3. Find the checkmark "Restrict where this project can be run"
>>>>>>>>>> and change the restriction from "beam" to the specific name of the agent
>>>>>>>>>> (ex. "apache-beam-jenkins-1").
>>>>>>>>>> 4. Save and apply that change.
>>>>>>>>>> 5. Back on the page for the job, click "Build with Parameters" on
>>>>>>>>>> the left menu.
>>>>>>>>>> 6. Run the build on "master".
>>>>>>>>>> 7. Once you're done checking the results, change the restriction
>>>>>>>>>> for the job back to "beam". (This also gets reset once every 24 hours in
>>>>>>>>>> case you forget.)
>>>>>>>>>>
>>>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday evening
>>>>>>>>>> when it wasn't too busy, and got Go updated and working. I checked that
>>>>>>>>>> agent's execution history again today just in case, and it was healthy over
>>>>>>>>>> the weekend, with no Go-related problems as far as I could see. If there's
>>>>>>>>>> no objections I'd like to go ahead and continue updating the rest of the
>>>>>>>>>> workers (I'll do this late at night or over the weekend to avoid disrupting
>>>>>>>>>> dev work).
>>>>>>>>>>
>>>>>>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I updated the image in [1], but did not change the workers yet
>>>>>>>>>>> to pick up the new image yet. We can do this once we add Go changes on top
>>>>>>>>>>> of it.
>>>>>>>>>>>
>>>>>>>>>>> I am also considering to SSH into every worker and run a
>>>>>>>>>>> one-line command that adds the dependency that was missing. It seems to be
>>>>>>>>>>> low risk, and  there is a fall-back plan to re-start the worker using the
>>>>>>>>>>> saved image - both new and old images are saved and available in Cloud
>>>>>>>>>>> Console.
>>>>>>>>>>>
>>>>>>>>>>> Ideally, we should find a way to do a rolling upgrade that a PMC
>>>>>>>>>>> or committer could trigger without logging into every machine.
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> @Brian Hulette <bh...@google.com> That button seems like
>>>>>>>>>>>> exactly what we'd need. Doing it manually would be a pain, but it's
>>>>>>>>>>>> probably still preferable to causing a bunch of aborted tests.
>>>>>>>>>>>>
>>>>>>>>>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating to do
>>>>>>>>>>>> both updates at once is a great idea! I'll message you directly about it.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I am also interested in this updating version of Python on
>>>>>>>>>>>>> VMs, I need to install Python 3.9. Thanks for looking into this.  We can
>>>>>>>>>>>>> coordinate together to make one update instead of two.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <
>>>>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not sure about best practices here. Out of curiosity I
>>>>>>>>>>>>>> just poked around in the Jenkins UI (e.g. [1]) and it looks like you can
>>>>>>>>>>>>>> manually "Mark node temporarily offline" when logged in (if you're a
>>>>>>>>>>>>>> committer). According to [2] this will prevent it from picking up new jobs
>>>>>>>>>>>>>> after it's finished the currently executing ones. Doing that manually for
>>>>>>>>>>>>>> every worker could be a pain though.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our Jenkins
>>>>>>>>>>>>>>> VMs, and I found these instructions on upgrading software
>>>>>>>>>>>>>>> on Jenkins
>>>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>>>>>>>>>> our cwiki.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I haven't started going through it yet, but I was wondering
>>>>>>>>>>>>>>> about the last few steps that involve stopping VMs, deleting boot disks,
>>>>>>>>>>>>>>> and restarting executors. Is there some best practice for that section to
>>>>>>>>>>>>>>> avoid causing interruptions in our automated testing? Should I be trying to
>>>>>>>>>>>>>>> do this outside of peak dev hours, or going one VM at a time so others can
>>>>>>>>>>>>>>> pick up extra load, or anything like that?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Robert Burke <ro...@frantil.com>.
+1 to only having one way to do things. The Lite option seems liable to
cause more problems since it means it's changes can be blown away if a new
image isn't prepared anyway.
I don't think we are changing the images often enough for it.  Perhaps call
it the option to test changes if anything?

On Tue, Oct 19, 2021, 11:55 AM Valentyn Tymofieiev <va...@google.com>
wrote:

> All workers were updated to use jenkins-slave-boot-image-20211011, which
> should have had a go command, but it appears slightly misconfigured. I
> reopened BEAM-13037 [1] and added some details there.
>
> I also added instructions to wiki [2] on how to perform an image swap and
> it is actually very straightforward. I think a lesson here is that making
> 'lite' upgrades is brittle as misconfigurations could resurface down the
> road when the context of the lite upgrade is no longer fresh in our memory.
>
> I suggest we revise the instructions to keep only image swap commands and
> remove the 'lite' update option. +Daniel Oliveira <da...@google.com>,
> WDYT?  In the meantime, we should also prepare an image that fixes the
> misconfiguration. Would you be able to help with that? Thank you.
>
> [1] https://issues.apache.org/jira/browse/BEAM-13037
> [2]
> https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers
>
>
> On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <ro...@frantil.com> wrote:
>
>> FYI it looks like all the Go tests are now failing because it can't find
>> the Go command at all.
>> Did a Jenkins image without Go (v1.16+) pre-installed get pushed?
>>
>> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>> Thanks Daniel,
>>>
>>> I can recreate the VMs on new disks.
>>>
>>> We currently have a set of stopped jenkins workers (named:
>>> apache-beam-jenkins-##) and running workers (named:
>>> apache-ci-beam-jenkins-##)
>>>
>>> Are there any concerns about deleting the stopped group of workers?
>>>
>>>
>>>
>>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com> wrote:
>>>
>>>> Thank you Daniel, Valentyn!
>>>>
>>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <da...@google.com>
>>>> wrote:
>>>>
>>>>> I performed a light update of both Go and Python (from Valentyn's
>>>>> update) on each worker VM over the weekend. I also added additional
>>>>> instructions for the light update to Confluence (as an alternative to the
>>>>> current instructions).
>>>>>
>>>>> There is still reason to perform a full update at some point: Valentyn
>>>>> updated the VM image from 500 GB to 1000 GB of storage, which requires a
>>>>> full update to actually take effect.
>>>>>
>>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <
>>>>> valentyn@google.com> wrote:
>>>>>
>>>>>> > 3. SSH into the agent and perform the update.
>>>>>> So, this would be a 'lite' version of the update, where we make
>>>>>> changes to the live worker without recreating worker VM with a new image?
>>>>>> We could perhaps document both options, and also make it clear that
>>>>>> producing a VM image that has necessary updates is mandatory even if we
>>>>>> perform 'lite' updates without recreating the worker.
>>>>>> Also, for a lite update, marking the Jenkins offer offline may be
>>>>>> optional, as some updates might not be disruptive (such as installing some
>>>>>> software that will not be used immediately).
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <ro...@frantil.com>
>>>>>> wrote:
>>>>>>
>>>>>>> SGTM. Thank you very much Daniel!
>>>>>>>
>>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com> wrote:
>>>>>>>
>>>>>>>> Thank you Daniel. Could you please update the wiki once you are
>>>>>>>> done with the process?
>>>>>>>>
>>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <
>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>
>>>>>>>>> Took me a bit to get to this, sorry. I finally figured out an
>>>>>>>>> approach for updating Go and did so and will be updating the image
>>>>>>>>> momentarily.
>>>>>>>>>
>>>>>>>>> I think a more important note is that I tried what Valentyn was
>>>>>>>>> considering, which is SSHing into workers and updating the dependency. I'll
>>>>>>>>> describe the process below, but the summary is that I did it on one worker
>>>>>>>>> with Go so far, saw no problems over the weekend, and would like to
>>>>>>>>> continue updating the rest of the workers if there are no objections.
>>>>>>>>>
>>>>>>>>> Here's a step-by-step of what I did. If we decide to stick with
>>>>>>>>> this approach, these instructions can be added to Confluence:
>>>>>>>>>
>>>>>>>>> 1. Go to the page for the Jenkins agent you want to update [1] and
>>>>>>>>> click "Mark this node temporarily offline", leaving a reason such as
>>>>>>>>> "Updating X dependency."
>>>>>>>>> 2. Wait until there are no more tests running in that agent (under
>>>>>>>>> "Build Executor Status" on the left of the page).
>>>>>>>>> 3. SSH into the agent and perform the update.
>>>>>>>>> 4. Mark the node as online again.
>>>>>>>>> 5. Repeat for every worker.
>>>>>>>>>
>>>>>>>>> And these are some additional steps if you want to immediately run
>>>>>>>>> a test suite to check that the update worked correctly. For example in my
>>>>>>>>> case, I wanted to check against the Go Postcommit, and it was a good thing
>>>>>>>>> I did, because it actually failed the first time and I had to go back in to
>>>>>>>>> fix a small oversight I made. So doing this after you update your first
>>>>>>>>> worker is probably a good idea before updating the rest:
>>>>>>>>>
>>>>>>>>> 1. Go to the page for the job you want to run (for example: [2]).
>>>>>>>>> 2. Click "Configure" on the left menu.
>>>>>>>>> 3. Find the checkmark "Restrict where this project can be run" and
>>>>>>>>> change the restriction from "beam" to the specific name of the agent (ex.
>>>>>>>>> "apache-beam-jenkins-1").
>>>>>>>>> 4. Save and apply that change.
>>>>>>>>> 5. Back on the page for the job, click "Build with Parameters" on
>>>>>>>>> the left menu.
>>>>>>>>> 6. Run the build on "master".
>>>>>>>>> 7. Once you're done checking the results, change the restriction
>>>>>>>>> for the job back to "beam". (This also gets reset once every 24 hours in
>>>>>>>>> case you forget.)
>>>>>>>>>
>>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday evening
>>>>>>>>> when it wasn't too busy, and got Go updated and working. I checked that
>>>>>>>>> agent's execution history again today just in case, and it was healthy over
>>>>>>>>> the weekend, with no Go-related problems as far as I could see. If there's
>>>>>>>>> no objections I'd like to go ahead and continue updating the rest of the
>>>>>>>>> workers (I'll do this late at night or over the weekend to avoid disrupting
>>>>>>>>> dev work).
>>>>>>>>>
>>>>>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>>>>>
>>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> I updated the image in [1], but did not change the workers yet to
>>>>>>>>>> pick up the new image yet. We can do this once we add Go changes on top of
>>>>>>>>>> it.
>>>>>>>>>>
>>>>>>>>>> I am also considering to SSH into every worker and run a one-line
>>>>>>>>>> command that adds the dependency that was missing. It seems to be low risk,
>>>>>>>>>> and  there is a fall-back plan to re-start the worker using the saved image
>>>>>>>>>> - both new and old images are saved and available in Cloud Console.
>>>>>>>>>>
>>>>>>>>>> Ideally, we should find a way to do a rolling upgrade that a PMC
>>>>>>>>>> or committer could trigger without logging into every machine.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> @Brian Hulette <bh...@google.com> That button seems like
>>>>>>>>>>> exactly what we'd need. Doing it manually would be a pain, but it's
>>>>>>>>>>> probably still preferable to causing a bunch of aborted tests.
>>>>>>>>>>>
>>>>>>>>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating to do
>>>>>>>>>>> both updates at once is a great idea! I'll message you directly about it.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I am also interested in this updating version of Python on VMs,
>>>>>>>>>>>> I need to install Python 3.9. Thanks for looking into this.  We can
>>>>>>>>>>>> coordinate together to make one update instead of two.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <
>>>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not sure about best practices here. Out of curiosity I
>>>>>>>>>>>>> just poked around in the Jenkins UI (e.g. [1]) and it looks like you can
>>>>>>>>>>>>> manually "Mark node temporarily offline" when logged in (if you're a
>>>>>>>>>>>>> committer). According to [2] this will prevent it from picking up new jobs
>>>>>>>>>>>>> after it's finished the currently executing ones. Doing that manually for
>>>>>>>>>>>>> every worker could be a pain though.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Brian
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>>>>>> [2]
>>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our Jenkins VMs,
>>>>>>>>>>>>>> and I found these instructions on upgrading software on
>>>>>>>>>>>>>> Jenkins
>>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>>>>>>>>> our cwiki.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I haven't started going through it yet, but I was wondering
>>>>>>>>>>>>>> about the last few steps that involve stopping VMs, deleting boot disks,
>>>>>>>>>>>>>> and restarting executors. Is there some best practice for that section to
>>>>>>>>>>>>>> avoid causing interruptions in our automated testing? Should I be trying to
>>>>>>>>>>>>>> do this outside of peak dev hours, or going one VM at a time so others can
>>>>>>>>>>>>>> pick up extra load, or anything like that?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Valentyn Tymofieiev <va...@google.com>.
All workers were updated to use jenkins-slave-boot-image-20211011, which
should have had a go command, but it appears slightly misconfigured. I
reopened BEAM-13037 [1] and added some details there.

I also added instructions to wiki [2] on how to perform an image swap and
it is actually very straightforward. I think a lesson here is that making
'lite' upgrades is brittle as misconfigurations could resurface down the
road when the context of the lite upgrade is no longer fresh in our memory.

I suggest we revise the instructions to keep only image swap commands and
remove the 'lite' update option. +Daniel Oliveira <da...@google.com>,
WDYT?  In the meantime, we should also prepare an image that fixes the
misconfiguration. Would you be able to help with that? Thank you.

[1] https://issues.apache.org/jira/browse/BEAM-13037
[2]
https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers


On Tue, Oct 19, 2021 at 8:46 AM Robert Burke <ro...@frantil.com> wrote:

> FYI it looks like all the Go tests are now failing because it can't find
> the Go command at all.
> Did a Jenkins image without Go (v1.16+) pre-installed get pushed?
>
> On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> Thanks Daniel,
>>
>> I can recreate the VMs on new disks.
>>
>> We currently have a set of stopped jenkins workers (named:
>> apache-beam-jenkins-##) and running workers (named:
>> apache-ci-beam-jenkins-##)
>>
>> Are there any concerns about deleting the stopped group of workers?
>>
>>
>>
>> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com> wrote:
>>
>>> Thank you Daniel, Valentyn!
>>>
>>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <da...@google.com>
>>> wrote:
>>>
>>>> I performed a light update of both Go and Python (from Valentyn's
>>>> update) on each worker VM over the weekend. I also added additional
>>>> instructions for the light update to Confluence (as an alternative to the
>>>> current instructions).
>>>>
>>>> There is still reason to perform a full update at some point: Valentyn
>>>> updated the VM image from 500 GB to 1000 GB of storage, which requires a
>>>> full update to actually take effect.
>>>>
>>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <
>>>> valentyn@google.com> wrote:
>>>>
>>>>> > 3. SSH into the agent and perform the update.
>>>>> So, this would be a 'lite' version of the update, where we make
>>>>> changes to the live worker without recreating worker VM with a new image?
>>>>> We could perhaps document both options, and also make it clear that
>>>>> producing a VM image that has necessary updates is mandatory even if we
>>>>> perform 'lite' updates without recreating the worker.
>>>>> Also, for a lite update, marking the Jenkins offer offline may be
>>>>> optional, as some updates might not be disruptive (such as installing some
>>>>> software that will not be used immediately).
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <ro...@frantil.com>
>>>>> wrote:
>>>>>
>>>>>> SGTM. Thank you very much Daniel!
>>>>>>
>>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com> wrote:
>>>>>>
>>>>>>> Thank you Daniel. Could you please update the wiki once you are done
>>>>>>> with the process?
>>>>>>>
>>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <
>>>>>>> danoliveira@google.com> wrote:
>>>>>>>
>>>>>>>> Took me a bit to get to this, sorry. I finally figured out an
>>>>>>>> approach for updating Go and did so and will be updating the image
>>>>>>>> momentarily.
>>>>>>>>
>>>>>>>> I think a more important note is that I tried what Valentyn was
>>>>>>>> considering, which is SSHing into workers and updating the dependency. I'll
>>>>>>>> describe the process below, but the summary is that I did it on one worker
>>>>>>>> with Go so far, saw no problems over the weekend, and would like to
>>>>>>>> continue updating the rest of the workers if there are no objections.
>>>>>>>>
>>>>>>>> Here's a step-by-step of what I did. If we decide to stick with
>>>>>>>> this approach, these instructions can be added to Confluence:
>>>>>>>>
>>>>>>>> 1. Go to the page for the Jenkins agent you want to update [1] and
>>>>>>>> click "Mark this node temporarily offline", leaving a reason such as
>>>>>>>> "Updating X dependency."
>>>>>>>> 2. Wait until there are no more tests running in that agent (under
>>>>>>>> "Build Executor Status" on the left of the page).
>>>>>>>> 3. SSH into the agent and perform the update.
>>>>>>>> 4. Mark the node as online again.
>>>>>>>> 5. Repeat for every worker.
>>>>>>>>
>>>>>>>> And these are some additional steps if you want to immediately run
>>>>>>>> a test suite to check that the update worked correctly. For example in my
>>>>>>>> case, I wanted to check against the Go Postcommit, and it was a good thing
>>>>>>>> I did, because it actually failed the first time and I had to go back in to
>>>>>>>> fix a small oversight I made. So doing this after you update your first
>>>>>>>> worker is probably a good idea before updating the rest:
>>>>>>>>
>>>>>>>> 1. Go to the page for the job you want to run (for example: [2]).
>>>>>>>> 2. Click "Configure" on the left menu.
>>>>>>>> 3. Find the checkmark "Restrict where this project can be run" and
>>>>>>>> change the restriction from "beam" to the specific name of the agent (ex.
>>>>>>>> "apache-beam-jenkins-1").
>>>>>>>> 4. Save and apply that change.
>>>>>>>> 5. Back on the page for the job, click "Build with Parameters" on
>>>>>>>> the left menu.
>>>>>>>> 6. Run the build on "master".
>>>>>>>> 7. Once you're done checking the results, change the restriction
>>>>>>>> for the job back to "beam". (This also gets reset once every 24 hours in
>>>>>>>> case you forget.)
>>>>>>>>
>>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday evening
>>>>>>>> when it wasn't too busy, and got Go updated and working. I checked that
>>>>>>>> agent's execution history again today just in case, and it was healthy over
>>>>>>>> the weekend, with no Go-related problems as far as I could see. If there's
>>>>>>>> no objections I'd like to go ahead and continue updating the rest of the
>>>>>>>> workers (I'll do this late at night or over the weekend to avoid disrupting
>>>>>>>> dev work).
>>>>>>>>
>>>>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>>>>
>>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>
>>>>>>>>> I updated the image in [1], but did not change the workers yet to
>>>>>>>>> pick up the new image yet. We can do this once we add Go changes on top of
>>>>>>>>> it.
>>>>>>>>>
>>>>>>>>> I am also considering to SSH into every worker and run a one-line
>>>>>>>>> command that adds the dependency that was missing. It seems to be low risk,
>>>>>>>>> and  there is a fall-back plan to re-start the worker using the saved image
>>>>>>>>> - both new and old images are saved and available in Cloud Console.
>>>>>>>>>
>>>>>>>>> Ideally, we should find a way to do a rolling upgrade that a PMC
>>>>>>>>> or committer could trigger without logging into every machine.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> @Brian Hulette <bh...@google.com> That button seems like
>>>>>>>>>> exactly what we'd need. Doing it manually would be a pain, but it's
>>>>>>>>>> probably still preferable to causing a bunch of aborted tests.
>>>>>>>>>>
>>>>>>>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating to do
>>>>>>>>>> both updates at once is a great idea! I'll message you directly about it.
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I am also interested in this updating version of Python on VMs,
>>>>>>>>>>> I need to install Python 3.9. Thanks for looking into this.  We can
>>>>>>>>>>> coordinate together to make one update instead of two.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <
>>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I'm not sure about best practices here. Out of curiosity I just
>>>>>>>>>>>> poked around in the Jenkins UI (e.g. [1]) and it looks like you can
>>>>>>>>>>>> manually "Mark node temporarily offline" when logged in (if you're a
>>>>>>>>>>>> committer). According to [2] this will prevent it from picking up new jobs
>>>>>>>>>>>> after it's finished the currently executing ones. Doing that manually for
>>>>>>>>>>>> every worker could be a pain though.
>>>>>>>>>>>>
>>>>>>>>>>>> Brian
>>>>>>>>>>>>
>>>>>>>>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our Jenkins VMs,
>>>>>>>>>>>>> and I found these instructions on upgrading software on
>>>>>>>>>>>>> Jenkins
>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>>>>>>>> our cwiki.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I haven't started going through it yet, but I was wondering
>>>>>>>>>>>>> about the last few steps that involve stopping VMs, deleting boot disks,
>>>>>>>>>>>>> and restarting executors. Is there some best practice for that section to
>>>>>>>>>>>>> avoid causing interruptions in our automated testing? Should I be trying to
>>>>>>>>>>>>> do this outside of peak dev hours, or going one VM at a time so others can
>>>>>>>>>>>>> pick up extra load, or anything like that?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Robert Burke <ro...@frantil.com>.
FYI it looks like all the Go tests are now failing because it can't find
the Go command at all.
Did a Jenkins image without Go (v1.16+) pre-installed get pushed?

On Mon, Oct 18, 2021, 1:45 PM Valentyn Tymofieiev <va...@google.com>
wrote:

> Thanks Daniel,
>
> I can recreate the VMs on new disks.
>
> We currently have a set of stopped jenkins workers (named:
> apache-beam-jenkins-##) and running workers (named:
> apache-ci-beam-jenkins-##)
>
> Are there any concerns about deleting the stopped group of workers?
>
>
>
> On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com> wrote:
>
>> Thank you Daniel, Valentyn!
>>
>> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <da...@google.com>
>> wrote:
>>
>>> I performed a light update of both Go and Python (from Valentyn's
>>> update) on each worker VM over the weekend. I also added additional
>>> instructions for the light update to Confluence (as an alternative to the
>>> current instructions).
>>>
>>> There is still reason to perform a full update at some point: Valentyn
>>> updated the VM image from 500 GB to 1000 GB of storage, which requires a
>>> full update to actually take effect.
>>>
>>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <
>>> valentyn@google.com> wrote:
>>>
>>>> > 3. SSH into the agent and perform the update.
>>>> So, this would be a 'lite' version of the update, where we make changes
>>>> to the live worker without recreating worker VM with a new image? We could
>>>> perhaps document both options, and also make it clear that producing a VM
>>>> image that has necessary updates is mandatory even if we perform 'lite'
>>>> updates without recreating the worker.
>>>> Also, for a lite update, marking the Jenkins offer offline may be
>>>> optional, as some updates might not be disruptive (such as installing some
>>>> software that will not be used immediately).
>>>>
>>>>
>>>>
>>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <ro...@frantil.com>
>>>> wrote:
>>>>
>>>>> SGTM. Thank you very much Daniel!
>>>>>
>>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com> wrote:
>>>>>
>>>>>> Thank you Daniel. Could you please update the wiki once you are done
>>>>>> with the process?
>>>>>>
>>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <
>>>>>> danoliveira@google.com> wrote:
>>>>>>
>>>>>>> Took me a bit to get to this, sorry. I finally figured out an
>>>>>>> approach for updating Go and did so and will be updating the image
>>>>>>> momentarily.
>>>>>>>
>>>>>>> I think a more important note is that I tried what Valentyn was
>>>>>>> considering, which is SSHing into workers and updating the dependency. I'll
>>>>>>> describe the process below, but the summary is that I did it on one worker
>>>>>>> with Go so far, saw no problems over the weekend, and would like to
>>>>>>> continue updating the rest of the workers if there are no objections.
>>>>>>>
>>>>>>> Here's a step-by-step of what I did. If we decide to stick with this
>>>>>>> approach, these instructions can be added to Confluence:
>>>>>>>
>>>>>>> 1. Go to the page for the Jenkins agent you want to update [1] and
>>>>>>> click "Mark this node temporarily offline", leaving a reason such as
>>>>>>> "Updating X dependency."
>>>>>>> 2. Wait until there are no more tests running in that agent (under
>>>>>>> "Build Executor Status" on the left of the page).
>>>>>>> 3. SSH into the agent and perform the update.
>>>>>>> 4. Mark the node as online again.
>>>>>>> 5. Repeat for every worker.
>>>>>>>
>>>>>>> And these are some additional steps if you want to immediately run a
>>>>>>> test suite to check that the update worked correctly. For example in my
>>>>>>> case, I wanted to check against the Go Postcommit, and it was a good thing
>>>>>>> I did, because it actually failed the first time and I had to go back in to
>>>>>>> fix a small oversight I made. So doing this after you update your first
>>>>>>> worker is probably a good idea before updating the rest:
>>>>>>>
>>>>>>> 1. Go to the page for the job you want to run (for example: [2]).
>>>>>>> 2. Click "Configure" on the left menu.
>>>>>>> 3. Find the checkmark "Restrict where this project can be run" and
>>>>>>> change the restriction from "beam" to the specific name of the agent (ex.
>>>>>>> "apache-beam-jenkins-1").
>>>>>>> 4. Save and apply that change.
>>>>>>> 5. Back on the page for the job, click "Build with Parameters" on
>>>>>>> the left menu.
>>>>>>> 6. Run the build on "master".
>>>>>>> 7. Once you're done checking the results, change the restriction for
>>>>>>> the job back to "beam". (This also gets reset once every 24 hours in case
>>>>>>> you forget.)
>>>>>>>
>>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday evening
>>>>>>> when it wasn't too busy, and got Go updated and working. I checked that
>>>>>>> agent's execution history again today just in case, and it was healthy over
>>>>>>> the weekend, with no Go-related problems as far as I could see. If there's
>>>>>>> no objections I'd like to go ahead and continue updating the rest of the
>>>>>>> workers (I'll do this late at night or over the weekend to avoid disrupting
>>>>>>> dev work).
>>>>>>>
>>>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>>>
>>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>>>> valentyn@google.com> wrote:
>>>>>>>
>>>>>>>> I updated the image in [1], but did not change the workers yet to
>>>>>>>> pick up the new image yet. We can do this once we add Go changes on top of
>>>>>>>> it.
>>>>>>>>
>>>>>>>> I am also considering to SSH into every worker and run a one-line
>>>>>>>> command that adds the dependency that was missing. It seems to be low risk,
>>>>>>>> and  there is a fall-back plan to re-start the worker using the saved image
>>>>>>>> - both new and old images are saved and available in Cloud Console.
>>>>>>>>
>>>>>>>> Ideally, we should find a way to do a rolling upgrade that a PMC or
>>>>>>>> committer could trigger without logging into every machine.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>
>>>>>>>>> @Brian Hulette <bh...@google.com> That button seems like
>>>>>>>>> exactly what we'd need. Doing it manually would be a pain, but it's
>>>>>>>>> probably still preferable to causing a bunch of aborted tests.
>>>>>>>>>
>>>>>>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating to do
>>>>>>>>> both updates at once is a great idea! I'll message you directly about it.
>>>>>>>>>
>>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> I am also interested in this updating version of Python on VMs, I
>>>>>>>>>> need to install Python 3.9. Thanks for looking into this.  We can
>>>>>>>>>> coordinate together to make one update instead of two.
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <
>>>>>>>>>> bhulette@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'm not sure about best practices here. Out of curiosity I just
>>>>>>>>>>> poked around in the Jenkins UI (e.g. [1]) and it looks like you can
>>>>>>>>>>> manually "Mark node temporarily offline" when logged in (if you're a
>>>>>>>>>>> committer). According to [2] this will prevent it from picking up new jobs
>>>>>>>>>>> after it's finished the currently executing ones. Doing that manually for
>>>>>>>>>>> every worker could be a pain though.
>>>>>>>>>>>
>>>>>>>>>>> Brian
>>>>>>>>>>>
>>>>>>>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>>>> [2]
>>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> I'm aiming at upgrading the version of Go on our Jenkins VMs,
>>>>>>>>>>>> and I found these instructions on upgrading software on Jenkins
>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>>>>>>> our cwiki.
>>>>>>>>>>>>
>>>>>>>>>>>> I haven't started going through it yet, but I was wondering
>>>>>>>>>>>> about the last few steps that involve stopping VMs, deleting boot disks,
>>>>>>>>>>>> and restarting executors. Is there some best practice for that section to
>>>>>>>>>>>> avoid causing interruptions in our automated testing? Should I be trying to
>>>>>>>>>>>> do this outside of peak dev hours, or going one VM at a time so others can
>>>>>>>>>>>> pick up extra load, or anything like that?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>>>
>>>>>>>>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Valentyn Tymofieiev <va...@google.com>.
Thanks Daniel,

I can recreate the VMs on new disks.

We currently have a set of stopped jenkins workers (named:
apache-beam-jenkins-##) and running workers (named:
apache-ci-beam-jenkins-##)

Are there any concerns about deleting the stopped group of workers?



On Mon, Oct 18, 2021 at 11:19 AM Ahmet Altay <al...@google.com> wrote:

> Thank you Daniel, Valentyn!
>
> On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <da...@google.com>
> wrote:
>
>> I performed a light update of both Go and Python (from Valentyn's update)
>> on each worker VM over the weekend. I also added additional instructions
>> for the light update to Confluence (as an alternative to the current
>> instructions).
>>
>> There is still reason to perform a full update at some point: Valentyn
>> updated the VM image from 500 GB to 1000 GB of storage, which requires a
>> full update to actually take effect.
>>
>> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>> > 3. SSH into the agent and perform the update.
>>> So, this would be a 'lite' version of the update, where we make changes
>>> to the live worker without recreating worker VM with a new image? We could
>>> perhaps document both options, and also make it clear that producing a VM
>>> image that has necessary updates is mandatory even if we perform 'lite'
>>> updates without recreating the worker.
>>> Also, for a lite update, marking the Jenkins offer offline may be
>>> optional, as some updates might not be disruptive (such as installing some
>>> software that will not be used immediately).
>>>
>>>
>>>
>>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <ro...@frantil.com> wrote:
>>>
>>>> SGTM. Thank you very much Daniel!
>>>>
>>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com> wrote:
>>>>
>>>>> Thank you Daniel. Could you please update the wiki once you are done
>>>>> with the process?
>>>>>
>>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <
>>>>> danoliveira@google.com> wrote:
>>>>>
>>>>>> Took me a bit to get to this, sorry. I finally figured out an
>>>>>> approach for updating Go and did so and will be updating the image
>>>>>> momentarily.
>>>>>>
>>>>>> I think a more important note is that I tried what Valentyn was
>>>>>> considering, which is SSHing into workers and updating the dependency. I'll
>>>>>> describe the process below, but the summary is that I did it on one worker
>>>>>> with Go so far, saw no problems over the weekend, and would like to
>>>>>> continue updating the rest of the workers if there are no objections.
>>>>>>
>>>>>> Here's a step-by-step of what I did. If we decide to stick with this
>>>>>> approach, these instructions can be added to Confluence:
>>>>>>
>>>>>> 1. Go to the page for the Jenkins agent you want to update [1] and
>>>>>> click "Mark this node temporarily offline", leaving a reason such as
>>>>>> "Updating X dependency."
>>>>>> 2. Wait until there are no more tests running in that agent (under
>>>>>> "Build Executor Status" on the left of the page).
>>>>>> 3. SSH into the agent and perform the update.
>>>>>> 4. Mark the node as online again.
>>>>>> 5. Repeat for every worker.
>>>>>>
>>>>>> And these are some additional steps if you want to immediately run a
>>>>>> test suite to check that the update worked correctly. For example in my
>>>>>> case, I wanted to check against the Go Postcommit, and it was a good thing
>>>>>> I did, because it actually failed the first time and I had to go back in to
>>>>>> fix a small oversight I made. So doing this after you update your first
>>>>>> worker is probably a good idea before updating the rest:
>>>>>>
>>>>>> 1. Go to the page for the job you want to run (for example: [2]).
>>>>>> 2. Click "Configure" on the left menu.
>>>>>> 3. Find the checkmark "Restrict where this project can be run" and
>>>>>> change the restriction from "beam" to the specific name of the agent (ex.
>>>>>> "apache-beam-jenkins-1").
>>>>>> 4. Save and apply that change.
>>>>>> 5. Back on the page for the job, click "Build with Parameters" on the
>>>>>> left menu.
>>>>>> 6. Run the build on "master".
>>>>>> 7. Once you're done checking the results, change the restriction for
>>>>>> the job back to "beam". (This also gets reset once every 24 hours in case
>>>>>> you forget.)
>>>>>>
>>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday evening
>>>>>> when it wasn't too busy, and got Go updated and working. I checked that
>>>>>> agent's execution history again today just in case, and it was healthy over
>>>>>> the weekend, with no Go-related problems as far as I could see. If there's
>>>>>> no objections I'd like to go ahead and continue updating the rest of the
>>>>>> workers (I'll do this late at night or over the weekend to avoid disrupting
>>>>>> dev work).
>>>>>>
>>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>>
>>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>>> valentyn@google.com> wrote:
>>>>>>
>>>>>>> I updated the image in [1], but did not change the workers yet to
>>>>>>> pick up the new image yet. We can do this once we add Go changes on top of
>>>>>>> it.
>>>>>>>
>>>>>>> I am also considering to SSH into every worker and run a one-line
>>>>>>> command that adds the dependency that was missing. It seems to be low risk,
>>>>>>> and  there is a fall-back plan to re-start the worker using the saved image
>>>>>>> - both new and old images are saved and available in Cloud Console.
>>>>>>>
>>>>>>> Ideally, we should find a way to do a rolling upgrade that a PMC or
>>>>>>> committer could trigger without logging into every machine.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>>> danoliveira@google.com> wrote:
>>>>>>>
>>>>>>>> @Brian Hulette <bh...@google.com> That button seems like
>>>>>>>> exactly what we'd need. Doing it manually would be a pain, but it's
>>>>>>>> probably still preferable to causing a bunch of aborted tests.
>>>>>>>>
>>>>>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating to do
>>>>>>>> both updates at once is a great idea! I'll message you directly about it.
>>>>>>>>
>>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>>> valentyn@google.com> wrote:
>>>>>>>>
>>>>>>>>> I am also interested in this updating version of Python on VMs, I
>>>>>>>>> need to install Python 3.9. Thanks for looking into this.  We can
>>>>>>>>> coordinate together to make one update instead of two.
>>>>>>>>>
>>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <bh...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I'm not sure about best practices here. Out of curiosity I just
>>>>>>>>>> poked around in the Jenkins UI (e.g. [1]) and it looks like you can
>>>>>>>>>> manually "Mark node temporarily offline" when logged in (if you're a
>>>>>>>>>> committer). According to [2] this will prevent it from picking up new jobs
>>>>>>>>>> after it's finished the currently executing ones. Doing that manually for
>>>>>>>>>> every worker could be a pain though.
>>>>>>>>>>
>>>>>>>>>> Brian
>>>>>>>>>>
>>>>>>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>>> [2]
>>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>
>>>>>>>>>>> I'm aiming at upgrading the version of Go on our Jenkins VMs,
>>>>>>>>>>> and I found these instructions on upgrading software on Jenkins
>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>>>>>> our cwiki.
>>>>>>>>>>>
>>>>>>>>>>> I haven't started going through it yet, but I was wondering
>>>>>>>>>>> about the last few steps that involve stopping VMs, deleting boot disks,
>>>>>>>>>>> and restarting executors. Is there some best practice for that section to
>>>>>>>>>>> avoid causing interruptions in our automated testing? Should I be trying to
>>>>>>>>>>> do this outside of peak dev hours, or going one VM at a time so others can
>>>>>>>>>>> pick up extra load, or anything like that?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>>
>>>>>>>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Ahmet Altay <al...@google.com>.
Thank you Daniel, Valentyn!

On Mon, Oct 18, 2021 at 8:02 AM Daniel Oliveira <da...@google.com>
wrote:

> I performed a light update of both Go and Python (from Valentyn's update)
> on each worker VM over the weekend. I also added additional instructions
> for the light update to Confluence (as an alternative to the current
> instructions).
>
> There is still reason to perform a full update at some point: Valentyn
> updated the VM image from 500 GB to 1000 GB of storage, which requires a
> full update to actually take effect.
>
> On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> > 3. SSH into the agent and perform the update.
>> So, this would be a 'lite' version of the update, where we make changes
>> to the live worker without recreating worker VM with a new image? We could
>> perhaps document both options, and also make it clear that producing a VM
>> image that has necessary updates is mandatory even if we perform 'lite'
>> updates without recreating the worker.
>> Also, for a lite update, marking the Jenkins offer offline may be
>> optional, as some updates might not be disruptive (such as installing some
>> software that will not be used immediately).
>>
>>
>>
>> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <ro...@frantil.com> wrote:
>>
>>> SGTM. Thank you very much Daniel!
>>>
>>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com> wrote:
>>>
>>>> Thank you Daniel. Could you please update the wiki once you are done
>>>> with the process?
>>>>
>>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <da...@google.com>
>>>> wrote:
>>>>
>>>>> Took me a bit to get to this, sorry. I finally figured out an approach
>>>>> for updating Go and did so and will be updating the image momentarily.
>>>>>
>>>>> I think a more important note is that I tried what Valentyn was
>>>>> considering, which is SSHing into workers and updating the dependency. I'll
>>>>> describe the process below, but the summary is that I did it on one worker
>>>>> with Go so far, saw no problems over the weekend, and would like to
>>>>> continue updating the rest of the workers if there are no objections.
>>>>>
>>>>> Here's a step-by-step of what I did. If we decide to stick with this
>>>>> approach, these instructions can be added to Confluence:
>>>>>
>>>>> 1. Go to the page for the Jenkins agent you want to update [1] and
>>>>> click "Mark this node temporarily offline", leaving a reason such as
>>>>> "Updating X dependency."
>>>>> 2. Wait until there are no more tests running in that agent (under
>>>>> "Build Executor Status" on the left of the page).
>>>>> 3. SSH into the agent and perform the update.
>>>>> 4. Mark the node as online again.
>>>>> 5. Repeat for every worker.
>>>>>
>>>>> And these are some additional steps if you want to immediately run a
>>>>> test suite to check that the update worked correctly. For example in my
>>>>> case, I wanted to check against the Go Postcommit, and it was a good thing
>>>>> I did, because it actually failed the first time and I had to go back in to
>>>>> fix a small oversight I made. So doing this after you update your first
>>>>> worker is probably a good idea before updating the rest:
>>>>>
>>>>> 1. Go to the page for the job you want to run (for example: [2]).
>>>>> 2. Click "Configure" on the left menu.
>>>>> 3. Find the checkmark "Restrict where this project can be run" and
>>>>> change the restriction from "beam" to the specific name of the agent (ex.
>>>>> "apache-beam-jenkins-1").
>>>>> 4. Save and apply that change.
>>>>> 5. Back on the page for the job, click "Build with Parameters" on the
>>>>> left menu.
>>>>> 6. Run the build on "master".
>>>>> 7. Once you're done checking the results, change the restriction for
>>>>> the job back to "beam". (This also gets reset once every 24 hours in case
>>>>> you forget.)
>>>>>
>>>>> I did that on one agent (apache-beam-jenkins-2) on Friday evening when
>>>>> it wasn't too busy, and got Go updated and working. I checked that agent's
>>>>> execution history again today just in case, and it was healthy over
>>>>> the weekend, with no Go-related problems as far as I could see. If there's
>>>>> no objections I'd like to go ahead and continue updating the rest of the
>>>>> workers (I'll do this late at night or over the weekend to avoid disrupting
>>>>> dev work).
>>>>>
>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>>
>>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <
>>>>> valentyn@google.com> wrote:
>>>>>
>>>>>> I updated the image in [1], but did not change the workers yet to
>>>>>> pick up the new image yet. We can do this once we add Go changes on top of
>>>>>> it.
>>>>>>
>>>>>> I am also considering to SSH into every worker and run a one-line
>>>>>> command that adds the dependency that was missing. It seems to be low risk,
>>>>>> and  there is a fall-back plan to re-start the worker using the saved image
>>>>>> - both new and old images are saved and available in Cloud Console.
>>>>>>
>>>>>> Ideally, we should find a way to do a rolling upgrade that a PMC or
>>>>>> committer could trigger without logging into every machine.
>>>>>>
>>>>>> [1]
>>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>>> danoliveira@google.com> wrote:
>>>>>>
>>>>>>> @Brian Hulette <bh...@google.com> That button seems like exactly
>>>>>>> what we'd need. Doing it manually would be a pain, but it's probably still
>>>>>>> preferable to causing a bunch of aborted tests.
>>>>>>>
>>>>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating to do both
>>>>>>> updates at once is a great idea! I'll message you directly about it.
>>>>>>>
>>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>>> valentyn@google.com> wrote:
>>>>>>>
>>>>>>>> I am also interested in this updating version of Python on VMs, I
>>>>>>>> need to install Python 3.9. Thanks for looking into this.  We can
>>>>>>>> coordinate together to make one update instead of two.
>>>>>>>>
>>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <bh...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I'm not sure about best practices here. Out of curiosity I just
>>>>>>>>> poked around in the Jenkins UI (e.g. [1]) and it looks like you can
>>>>>>>>> manually "Mark node temporarily offline" when logged in (if you're a
>>>>>>>>> committer). According to [2] this will prevent it from picking up new jobs
>>>>>>>>> after it's finished the currently executing ones. Doing that manually for
>>>>>>>>> every worker could be a pain though.
>>>>>>>>>
>>>>>>>>> Brian
>>>>>>>>>
>>>>>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>>> [2]
>>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>>
>>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hey everyone,
>>>>>>>>>>
>>>>>>>>>> I'm aiming at upgrading the version of Go on our Jenkins VMs, and
>>>>>>>>>> I found these instructions on upgrading software on Jenkins
>>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>>>>> our cwiki.
>>>>>>>>>>
>>>>>>>>>> I haven't started going through it yet, but I was wondering about
>>>>>>>>>> the last few steps that involve stopping VMs, deleting boot disks, and
>>>>>>>>>> restarting executors. Is there some best practice for that section to avoid
>>>>>>>>>> causing interruptions in our automated testing? Should I be trying to do
>>>>>>>>>> this outside of peak dev hours, or going one VM at a time so others can
>>>>>>>>>> pick up extra load, or anything like that?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Daniel Oliveira
>>>>>>>>>>
>>>>>>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Daniel Oliveira <da...@google.com>.
I performed a light update of both Go and Python (from Valentyn's update)
on each worker VM over the weekend. I also added additional instructions
for the light update to Confluence (as an alternative to the current
instructions).

There is still reason to perform a full update at some point: Valentyn
updated the VM image from 500 GB to 1000 GB of storage, which requires a
full update to actually take effect.

On Tue, Oct 12, 2021 at 10:32 AM Valentyn Tymofieiev <va...@google.com>
wrote:

> > 3. SSH into the agent and perform the update.
> So, this would be a 'lite' version of the update, where we make changes to
> the live worker without recreating worker VM with a new image? We could
> perhaps document both options, and also make it clear that producing a VM
> image that has necessary updates is mandatory even if we perform 'lite'
> updates without recreating the worker.
> Also, for a lite update, marking the Jenkins offer offline may be
> optional, as some updates might not be disruptive (such as installing some
> software that will not be used immediately).
>
>
>
> On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <ro...@frantil.com> wrote:
>
>> SGTM. Thank you very much Daniel!
>>
>> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com> wrote:
>>
>>> Thank you Daniel. Could you please update the wiki once you are done
>>> with the process?
>>>
>>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <da...@google.com>
>>> wrote:
>>>
>>>> Took me a bit to get to this, sorry. I finally figured out an approach
>>>> for updating Go and did so and will be updating the image momentarily.
>>>>
>>>> I think a more important note is that I tried what Valentyn was
>>>> considering, which is SSHing into workers and updating the dependency. I'll
>>>> describe the process below, but the summary is that I did it on one worker
>>>> with Go so far, saw no problems over the weekend, and would like to
>>>> continue updating the rest of the workers if there are no objections.
>>>>
>>>> Here's a step-by-step of what I did. If we decide to stick with this
>>>> approach, these instructions can be added to Confluence:
>>>>
>>>> 1. Go to the page for the Jenkins agent you want to update [1] and
>>>> click "Mark this node temporarily offline", leaving a reason such as
>>>> "Updating X dependency."
>>>> 2. Wait until there are no more tests running in that agent (under
>>>> "Build Executor Status" on the left of the page).
>>>> 3. SSH into the agent and perform the update.
>>>> 4. Mark the node as online again.
>>>> 5. Repeat for every worker.
>>>>
>>>> And these are some additional steps if you want to immediately run a
>>>> test suite to check that the update worked correctly. For example in my
>>>> case, I wanted to check against the Go Postcommit, and it was a good thing
>>>> I did, because it actually failed the first time and I had to go back in to
>>>> fix a small oversight I made. So doing this after you update your first
>>>> worker is probably a good idea before updating the rest:
>>>>
>>>> 1. Go to the page for the job you want to run (for example: [2]).
>>>> 2. Click "Configure" on the left menu.
>>>> 3. Find the checkmark "Restrict where this project can be run" and
>>>> change the restriction from "beam" to the specific name of the agent (ex.
>>>> "apache-beam-jenkins-1").
>>>> 4. Save and apply that change.
>>>> 5. Back on the page for the job, click "Build with Parameters" on the
>>>> left menu.
>>>> 6. Run the build on "master".
>>>> 7. Once you're done checking the results, change the restriction for
>>>> the job back to "beam". (This also gets reset once every 24 hours in case
>>>> you forget.)
>>>>
>>>> I did that on one agent (apache-beam-jenkins-2) on Friday evening when
>>>> it wasn't too busy, and got Go updated and working. I checked that agent's
>>>> execution history again today just in case, and it was healthy over
>>>> the weekend, with no Go-related problems as far as I could see. If there's
>>>> no objections I'd like to go ahead and continue updating the rest of the
>>>> workers (I'll do this late at night or over the weekend to avoid disrupting
>>>> dev work).
>>>>
>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>>
>>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <va...@google.com>
>>>> wrote:
>>>>
>>>>> I updated the image in [1], but did not change the workers yet to pick
>>>>> up the new image yet. We can do this once we add Go changes on top of it.
>>>>>
>>>>> I am also considering to SSH into every worker and run a one-line
>>>>> command that adds the dependency that was missing. It seems to be low risk,
>>>>> and  there is a fall-back plan to re-start the worker using the saved image
>>>>> - both new and old images are saved and available in Cloud Console.
>>>>>
>>>>> Ideally, we should find a way to do a rolling upgrade that a PMC or
>>>>> committer could trigger without logging into every machine.
>>>>>
>>>>> [1]
>>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>>
>>>>>
>>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <
>>>>> danoliveira@google.com> wrote:
>>>>>
>>>>>> @Brian Hulette <bh...@google.com> That button seems like exactly
>>>>>> what we'd need. Doing it manually would be a pain, but it's probably still
>>>>>> preferable to causing a bunch of aborted tests.
>>>>>>
>>>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating to do both
>>>>>> updates at once is a great idea! I'll message you directly about it.
>>>>>>
>>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>>> valentyn@google.com> wrote:
>>>>>>
>>>>>>> I am also interested in this updating version of Python on VMs, I
>>>>>>> need to install Python 3.9. Thanks for looking into this.  We can
>>>>>>> coordinate together to make one update instead of two.
>>>>>>>
>>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <bh...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm not sure about best practices here. Out of curiosity I just
>>>>>>>> poked around in the Jenkins UI (e.g. [1]) and it looks like you can
>>>>>>>> manually "Mark node temporarily offline" when logged in (if you're a
>>>>>>>> committer). According to [2] this will prevent it from picking up new jobs
>>>>>>>> after it's finished the currently executing ones. Doing that manually for
>>>>>>>> every worker could be a pain though.
>>>>>>>>
>>>>>>>> Brian
>>>>>>>>
>>>>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>>> [2]
>>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>>
>>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>>> danoliveira@google.com> wrote:
>>>>>>>>
>>>>>>>>> Hey everyone,
>>>>>>>>>
>>>>>>>>> I'm aiming at upgrading the version of Go on our Jenkins VMs, and
>>>>>>>>> I found these instructions on upgrading software on Jenkins
>>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>>>> our cwiki.
>>>>>>>>>
>>>>>>>>> I haven't started going through it yet, but I was wondering about
>>>>>>>>> the last few steps that involve stopping VMs, deleting boot disks, and
>>>>>>>>> restarting executors. Is there some best practice for that section to avoid
>>>>>>>>> causing interruptions in our automated testing? Should I be trying to do
>>>>>>>>> this outside of peak dev hours, or going one VM at a time so others can
>>>>>>>>> pick up extra load, or anything like that?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Daniel Oliveira
>>>>>>>>>
>>>>>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Valentyn Tymofieiev <va...@google.com>.
> 3. SSH into the agent and perform the update.
So, this would be a 'lite' version of the update, where we make changes to
the live worker without recreating worker VM with a new image? We could
perhaps document both options, and also make it clear that producing a VM
image that has necessary updates is mandatory even if we perform 'lite'
updates without recreating the worker.
Also, for a lite update, marking the Jenkins offer offline may be optional,
as some updates might not be disruptive (such as installing some software
that will not be used immediately).



On Mon, Oct 11, 2021 at 7:53 PM Robert Burke <ro...@frantil.com> wrote:

> SGTM. Thank you very much Daniel!
>
> On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com> wrote:
>
>> Thank you Daniel. Could you please update the wiki once you are done with
>> the process?
>>
>> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <da...@google.com>
>> wrote:
>>
>>> Took me a bit to get to this, sorry. I finally figured out an approach
>>> for updating Go and did so and will be updating the image momentarily.
>>>
>>> I think a more important note is that I tried what Valentyn was
>>> considering, which is SSHing into workers and updating the dependency. I'll
>>> describe the process below, but the summary is that I did it on one worker
>>> with Go so far, saw no problems over the weekend, and would like to
>>> continue updating the rest of the workers if there are no objections.
>>>
>>> Here's a step-by-step of what I did. If we decide to stick with this
>>> approach, these instructions can be added to Confluence:
>>>
>>> 1. Go to the page for the Jenkins agent you want to update [1] and click
>>> "Mark this node temporarily offline", leaving a reason such as "Updating X
>>> dependency."
>>> 2. Wait until there are no more tests running in that agent (under
>>> "Build Executor Status" on the left of the page).
>>> 3. SSH into the agent and perform the update.
>>> 4. Mark the node as online again.
>>> 5. Repeat for every worker.
>>>
>>> And these are some additional steps if you want to immediately run a
>>> test suite to check that the update worked correctly. For example in my
>>> case, I wanted to check against the Go Postcommit, and it was a good thing
>>> I did, because it actually failed the first time and I had to go back in to
>>> fix a small oversight I made. So doing this after you update your first
>>> worker is probably a good idea before updating the rest:
>>>
>>> 1. Go to the page for the job you want to run (for example: [2]).
>>> 2. Click "Configure" on the left menu.
>>> 3. Find the checkmark "Restrict where this project can be run" and
>>> change the restriction from "beam" to the specific name of the agent (ex.
>>> "apache-beam-jenkins-1").
>>> 4. Save and apply that change.
>>> 5. Back on the page for the job, click "Build with Parameters" on the
>>> left menu.
>>> 6. Run the build on "master".
>>> 7. Once you're done checking the results, change the restriction for the
>>> job back to "beam". (This also gets reset once every 24 hours in case you
>>> forget.)
>>>
>>> I did that on one agent (apache-beam-jenkins-2) on Friday evening when
>>> it wasn't too busy, and got Go updated and working. I checked that agent's
>>> execution history again today just in case, and it was healthy over
>>> the weekend, with no Go-related problems as far as I could see. If there's
>>> no objections I'd like to go ahead and continue updating the rest of the
>>> workers (I'll do this late at night or over the weekend to avoid disrupting
>>> dev work).
>>>
>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>>
>>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <va...@google.com>
>>> wrote:
>>>
>>>> I updated the image in [1], but did not change the workers yet to pick
>>>> up the new image yet. We can do this once we add Go changes on top of it.
>>>>
>>>> I am also considering to SSH into every worker and run a one-line
>>>> command that adds the dependency that was missing. It seems to be low risk,
>>>> and  there is a fall-back plan to re-start the worker using the saved image
>>>> - both new and old images are saved and available in Cloud Console.
>>>>
>>>> Ideally, we should find a way to do a rolling upgrade that a PMC or
>>>> committer could trigger without logging into every machine.
>>>>
>>>> [1]
>>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>>
>>>>
>>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <da...@google.com>
>>>> wrote:
>>>>
>>>>> @Brian Hulette <bh...@google.com> That button seems like exactly
>>>>> what we'd need. Doing it manually would be a pain, but it's probably still
>>>>> preferable to causing a bunch of aborted tests.
>>>>>
>>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating to do both
>>>>> updates at once is a great idea! I'll message you directly about it.
>>>>>
>>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>>> valentyn@google.com> wrote:
>>>>>
>>>>>> I am also interested in this updating version of Python on VMs, I
>>>>>> need to install Python 3.9. Thanks for looking into this.  We can
>>>>>> coordinate together to make one update instead of two.
>>>>>>
>>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <bh...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm not sure about best practices here. Out of curiosity I just
>>>>>>> poked around in the Jenkins UI (e.g. [1]) and it looks like you can
>>>>>>> manually "Mark node temporarily offline" when logged in (if you're a
>>>>>>> committer). According to [2] this will prevent it from picking up new jobs
>>>>>>> after it's finished the currently executing ones. Doing that manually for
>>>>>>> every worker could be a pain though.
>>>>>>>
>>>>>>> Brian
>>>>>>>
>>>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>>> [2]
>>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>>
>>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>>> danoliveira@google.com> wrote:
>>>>>>>
>>>>>>>> Hey everyone,
>>>>>>>>
>>>>>>>> I'm aiming at upgrading the version of Go on our Jenkins VMs, and I
>>>>>>>> found these instructions on upgrading software on Jenkins
>>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>>> our cwiki.
>>>>>>>>
>>>>>>>> I haven't started going through it yet, but I was wondering about
>>>>>>>> the last few steps that involve stopping VMs, deleting boot disks, and
>>>>>>>> restarting executors. Is there some best practice for that section to avoid
>>>>>>>> causing interruptions in our automated testing? Should I be trying to do
>>>>>>>> this outside of peak dev hours, or going one VM at a time so others can
>>>>>>>> pick up extra load, or anything like that?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Daniel Oliveira
>>>>>>>>
>>>>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Robert Burke <ro...@frantil.com>.
SGTM. Thank you very much Daniel!

On Mon, Oct 11, 2021, 7:51 PM Ahmet Altay <al...@google.com> wrote:

> Thank you Daniel. Could you please update the wiki once you are done with
> the process?
>
> On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <da...@google.com>
> wrote:
>
>> Took me a bit to get to this, sorry. I finally figured out an approach
>> for updating Go and did so and will be updating the image momentarily.
>>
>> I think a more important note is that I tried what Valentyn was
>> considering, which is SSHing into workers and updating the dependency. I'll
>> describe the process below, but the summary is that I did it on one worker
>> with Go so far, saw no problems over the weekend, and would like to
>> continue updating the rest of the workers if there are no objections.
>>
>> Here's a step-by-step of what I did. If we decide to stick with this
>> approach, these instructions can be added to Confluence:
>>
>> 1. Go to the page for the Jenkins agent you want to update [1] and click
>> "Mark this node temporarily offline", leaving a reason such as "Updating X
>> dependency."
>> 2. Wait until there are no more tests running in that agent (under "Build
>> Executor Status" on the left of the page).
>> 3. SSH into the agent and perform the update.
>> 4. Mark the node as online again.
>> 5. Repeat for every worker.
>>
>> And these are some additional steps if you want to immediately run a test
>> suite to check that the update worked correctly. For example in my case, I
>> wanted to check against the Go Postcommit, and it was a good thing I did,
>> because it actually failed the first time and I had to go back in to fix a
>> small oversight I made. So doing this after you update your first worker is
>> probably a good idea before updating the rest:
>>
>> 1. Go to the page for the job you want to run (for example: [2]).
>> 2. Click "Configure" on the left menu.
>> 3. Find the checkmark "Restrict where this project can be run" and change
>> the restriction from "beam" to the specific name of the agent (ex.
>> "apache-beam-jenkins-1").
>> 4. Save and apply that change.
>> 5. Back on the page for the job, click "Build with Parameters" on the
>> left menu.
>> 6. Run the build on "master".
>> 7. Once you're done checking the results, change the restriction for the
>> job back to "beam". (This also gets reset once every 24 hours in case you
>> forget.)
>>
>> I did that on one agent (apache-beam-jenkins-2) on Friday evening when it
>> wasn't too busy, and got Go updated and working. I checked that agent's
>> execution history again today just in case, and it was healthy over
>> the weekend, with no Go-related problems as far as I could see. If there's
>> no objections I'd like to go ahead and continue updating the rest of the
>> workers (I'll do this late at night or over the weekend to avoid disrupting
>> dev work).
>>
>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
>> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>>
>> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>> I updated the image in [1], but did not change the workers yet to pick
>>> up the new image yet. We can do this once we add Go changes on top of it.
>>>
>>> I am also considering to SSH into every worker and run a one-line
>>> command that adds the dependency that was missing. It seems to be low risk,
>>> and  there is a fall-back plan to re-start the worker using the saved image
>>> - both new and old images are saved and available in Cloud Console.
>>>
>>> Ideally, we should find a way to do a rolling upgrade that a PMC or
>>> committer could trigger without logging into every machine.
>>>
>>> [1]
>>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>>
>>>
>>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <da...@google.com>
>>> wrote:
>>>
>>>> @Brian Hulette <bh...@google.com> That button seems like exactly
>>>> what we'd need. Doing it manually would be a pain, but it's probably still
>>>> preferable to causing a bunch of aborted tests.
>>>>
>>>> @Valentyn Tymofieiev <va...@google.com> Collaborating to do both
>>>> updates at once is a great idea! I'll message you directly about it.
>>>>
>>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <
>>>> valentyn@google.com> wrote:
>>>>
>>>>> I am also interested in this updating version of Python on VMs, I need
>>>>> to install Python 3.9. Thanks for looking into this.  We can coordinate
>>>>> together to make one update instead of two.
>>>>>
>>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <bh...@google.com>
>>>>> wrote:
>>>>>
>>>>>> I'm not sure about best practices here. Out of curiosity I just poked
>>>>>> around in the Jenkins UI (e.g. [1]) and it looks like you can manually
>>>>>> "Mark node temporarily offline" when logged in (if you're a committer).
>>>>>> According to [2] this will prevent it from picking up new jobs after it's
>>>>>> finished the currently executing ones. Doing that manually for every worker
>>>>>> could be a pain though.
>>>>>>
>>>>>> Brian
>>>>>>
>>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>>> [2]
>>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>>
>>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>>> danoliveira@google.com> wrote:
>>>>>>
>>>>>>> Hey everyone,
>>>>>>>
>>>>>>> I'm aiming at upgrading the version of Go on our Jenkins VMs, and I
>>>>>>> found these instructions on upgrading software on Jenkins
>>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>>> our cwiki.
>>>>>>>
>>>>>>> I haven't started going through it yet, but I was wondering about
>>>>>>> the last few steps that involve stopping VMs, deleting boot disks, and
>>>>>>> restarting executors. Is there some best practice for that section to avoid
>>>>>>> causing interruptions in our automated testing? Should I be trying to do
>>>>>>> this outside of peak dev hours, or going one VM at a time so others can
>>>>>>> pick up extra load, or anything like that?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Daniel Oliveira
>>>>>>>
>>>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Ahmet Altay <al...@google.com>.
Thank you Daniel. Could you please update the wiki once you are done with
the process?

On Mon, Oct 11, 2021 at 6:22 PM Daniel Oliveira <da...@google.com>
wrote:

> Took me a bit to get to this, sorry. I finally figured out an approach for
> updating Go and did so and will be updating the image momentarily.
>
> I think a more important note is that I tried what Valentyn was
> considering, which is SSHing into workers and updating the dependency. I'll
> describe the process below, but the summary is that I did it on one worker
> with Go so far, saw no problems over the weekend, and would like to
> continue updating the rest of the workers if there are no objections.
>
> Here's a step-by-step of what I did. If we decide to stick with this
> approach, these instructions can be added to Confluence:
>
> 1. Go to the page for the Jenkins agent you want to update [1] and click
> "Mark this node temporarily offline", leaving a reason such as "Updating X
> dependency."
> 2. Wait until there are no more tests running in that agent (under "Build
> Executor Status" on the left of the page).
> 3. SSH into the agent and perform the update.
> 4. Mark the node as online again.
> 5. Repeat for every worker.
>
> And these are some additional steps if you want to immediately run a test
> suite to check that the update worked correctly. For example in my case, I
> wanted to check against the Go Postcommit, and it was a good thing I did,
> because it actually failed the first time and I had to go back in to fix a
> small oversight I made. So doing this after you update your first worker is
> probably a good idea before updating the rest:
>
> 1. Go to the page for the job you want to run (for example: [2]).
> 2. Click "Configure" on the left menu.
> 3. Find the checkmark "Restrict where this project can be run" and change
> the restriction from "beam" to the specific name of the agent (ex.
> "apache-beam-jenkins-1").
> 4. Save and apply that change.
> 5. Back on the page for the job, click "Build with Parameters" on the left
> menu.
> 6. Run the build on "master".
> 7. Once you're done checking the results, change the restriction for the
> job back to "beam". (This also gets reset once every 24 hours in case you
> forget.)
>
> I did that on one agent (apache-beam-jenkins-2) on Friday evening when it
> wasn't too busy, and got Go updated and working. I checked that agent's
> execution history again today just in case, and it was healthy over
> the weekend, with no Go-related problems as far as I could see. If there's
> no objections I'd like to go ahead and continue updating the rest of the
> workers (I'll do this late at night or over the weekend to avoid disrupting
> dev work).
>
> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
> [2] https://ci-beam.apache.org/job/beam_PostCommit_Go/
>
> On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> I updated the image in [1], but did not change the workers yet to pick up
>> the new image yet. We can do this once we add Go changes on top of it.
>>
>> I am also considering to SSH into every worker and run a one-line command
>> that adds the dependency that was missing. It seems to be low risk, and
>> there is a fall-back plan to re-start the worker using the saved image -
>> both new and old images are saved and available in Cloud Console.
>>
>> Ideally, we should find a way to do a rolling upgrade that a PMC or
>> committer could trigger without logging into every machine.
>>
>> [1]
>> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>>
>>
>> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <da...@google.com>
>> wrote:
>>
>>> @Brian Hulette <bh...@google.com> That button seems like exactly
>>> what we'd need. Doing it manually would be a pain, but it's probably still
>>> preferable to causing a bunch of aborted tests.
>>>
>>> @Valentyn Tymofieiev <va...@google.com> Collaborating to do both
>>> updates at once is a great idea! I'll message you directly about it.
>>>
>>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <va...@google.com>
>>> wrote:
>>>
>>>> I am also interested in this updating version of Python on VMs, I need
>>>> to install Python 3.9. Thanks for looking into this.  We can coordinate
>>>> together to make one update instead of two.
>>>>
>>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <bh...@google.com>
>>>> wrote:
>>>>
>>>>> I'm not sure about best practices here. Out of curiosity I just poked
>>>>> around in the Jenkins UI (e.g. [1]) and it looks like you can manually
>>>>> "Mark node temporarily offline" when logged in (if you're a committer).
>>>>> According to [2] this will prevent it from picking up new jobs after it's
>>>>> finished the currently executing ones. Doing that manually for every worker
>>>>> could be a pain though.
>>>>>
>>>>> Brian
>>>>>
>>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>>> [2]
>>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>>
>>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <
>>>>> danoliveira@google.com> wrote:
>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>> I'm aiming at upgrading the version of Go on our Jenkins VMs, and I
>>>>>> found these instructions on upgrading software on Jenkins
>>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>>> our cwiki.
>>>>>>
>>>>>> I haven't started going through it yet, but I was wondering about the
>>>>>> last few steps that involve stopping VMs, deleting boot disks, and
>>>>>> restarting executors. Is there some best practice for that section to avoid
>>>>>> causing interruptions in our automated testing? Should I be trying to do
>>>>>> this outside of peak dev hours, or going one VM at a time so others can
>>>>>> pick up extra load, or anything like that?
>>>>>>
>>>>>> Thanks,
>>>>>> Daniel Oliveira
>>>>>>
>>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Daniel Oliveira <da...@google.com>.
Took me a bit to get to this, sorry. I finally figured out an approach for
updating Go and did so and will be updating the image momentarily.

I think a more important note is that I tried what Valentyn was
considering, which is SSHing into workers and updating the dependency. I'll
describe the process below, but the summary is that I did it on one worker
with Go so far, saw no problems over the weekend, and would like to
continue updating the rest of the workers if there are no objections.

Here's a step-by-step of what I did. If we decide to stick with this
approach, these instructions can be added to Confluence:

1. Go to the page for the Jenkins agent you want to update [1] and click
"Mark this node temporarily offline", leaving a reason such as "Updating X
dependency."
2. Wait until there are no more tests running in that agent (under "Build
Executor Status" on the left of the page).
3. SSH into the agent and perform the update.
4. Mark the node as online again.
5. Repeat for every worker.

And these are some additional steps if you want to immediately run a test
suite to check that the update worked correctly. For example in my case, I
wanted to check against the Go Postcommit, and it was a good thing I did,
because it actually failed the first time and I had to go back in to fix a
small oversight I made. So doing this after you update your first worker is
probably a good idea before updating the rest:

1. Go to the page for the job you want to run (for example: [2]).
2. Click "Configure" on the left menu.
3. Find the checkmark "Restrict where this project can be run" and change
the restriction from "beam" to the specific name of the agent (ex.
"apache-beam-jenkins-1").
4. Save and apply that change.
5. Back on the page for the job, click "Build with Parameters" on the left
menu.
6. Run the build on "master".
7. Once you're done checking the results, change the restriction for the
job back to "beam". (This also gets reset once every 24 hours in case you
forget.)

I did that on one agent (apache-beam-jenkins-2) on Friday evening when it
wasn't too busy, and got Go updated and working. I checked that agent's
execution history again today just in case, and it was healthy over
the weekend, with no Go-related problems as far as I could see. If there's
no objections I'd like to go ahead and continue updating the rest of the
workers (I'll do this late at night or over the weekend to avoid disrupting
dev work).

[1] https://ci-beam.apache.org/computer/apache-beam-jenkins-1/
[2] https://ci-beam.apache.org/job/beam_PostCommit_Go/

On Mon, Oct 4, 2021 at 6:14 PM Valentyn Tymofieiev <va...@google.com>
wrote:

> I updated the image in [1], but did not change the workers yet to pick up
> the new image yet. We can do this once we add Go changes on top of it.
>
> I am also considering to SSH into every worker and run a one-line command
> that adds the dependency that was missing. It seems to be low risk, and
> there is a fall-back plan to re-start the worker using the saved image -
> both new and old images are saved and available in Cloud Console.
>
> Ideally, we should find a way to do a rolling upgrade that a PMC or
> committer could trigger without logging into every machine.
>
> [1]
> https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228
>
>
> On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <da...@google.com>
> wrote:
>
>> @Brian Hulette <bh...@google.com> That button seems like exactly what
>> we'd need. Doing it manually would be a pain, but it's probably still
>> preferable to causing a bunch of aborted tests.
>>
>> @Valentyn Tymofieiev <va...@google.com> Collaborating to do both
>> updates at once is a great idea! I'll message you directly about it.
>>
>> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <va...@google.com>
>> wrote:
>>
>>> I am also interested in this updating version of Python on VMs, I need
>>> to install Python 3.9. Thanks for looking into this.  We can coordinate
>>> together to make one update instead of two.
>>>
>>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <bh...@google.com>
>>> wrote:
>>>
>>>> I'm not sure about best practices here. Out of curiosity I just poked
>>>> around in the Jenkins UI (e.g. [1]) and it looks like you can manually
>>>> "Mark node temporarily offline" when logged in (if you're a committer).
>>>> According to [2] this will prevent it from picking up new jobs after it's
>>>> finished the currently executing ones. Doing that manually for every worker
>>>> could be a pain though.
>>>>
>>>> Brian
>>>>
>>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>>> [2]
>>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>>
>>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <da...@google.com>
>>>> wrote:
>>>>
>>>>> Hey everyone,
>>>>>
>>>>> I'm aiming at upgrading the version of Go on our Jenkins VMs, and I
>>>>> found these instructions on upgrading software on Jenkins
>>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>>> our cwiki.
>>>>>
>>>>> I haven't started going through it yet, but I was wondering about the
>>>>> last few steps that involve stopping VMs, deleting boot disks, and
>>>>> restarting executors. Is there some best practice for that section to avoid
>>>>> causing interruptions in our automated testing? Should I be trying to do
>>>>> this outside of peak dev hours, or going one VM at a time so others can
>>>>> pick up extra load, or anything like that?
>>>>>
>>>>> Thanks,
>>>>> Daniel Oliveira
>>>>>
>>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Valentyn Tymofieiev <va...@google.com>.
I updated the image in [1], but did not change the workers yet to pick up
the new image yet. We can do this once we add Go changes on top of it.

I am also considering to SSH into every worker and run a one-line command
that adds the dependency that was missing. It seems to be low risk, and
there is a fall-back plan to re-start the worker using the saved image -
both new and old images are saved and available in Cloud Console.

Ideally, we should find a way to do a rolling upgrade that a PMC or
committer could trigger without logging into every machine.

[1]
https://issues.apache.org/jira/browse/BEAM-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424228#comment-17424228


On Wed, Sep 22, 2021 at 3:28 PM Daniel Oliveira <da...@google.com>
wrote:

> @Brian Hulette <bh...@google.com> That button seems like exactly what
> we'd need. Doing it manually would be a pain, but it's probably still
> preferable to causing a bunch of aborted tests.
>
> @Valentyn Tymofieiev <va...@google.com> Collaborating to do both
> updates at once is a great idea! I'll message you directly about it.
>
> On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <va...@google.com>
> wrote:
>
>> I am also interested in this updating version of Python on VMs, I need to
>> install Python 3.9. Thanks for looking into this.  We can coordinate
>> together to make one update instead of two.
>>
>> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <bh...@google.com>
>> wrote:
>>
>>> I'm not sure about best practices here. Out of curiosity I just poked
>>> around in the Jenkins UI (e.g. [1]) and it looks like you can manually
>>> "Mark node temporarily offline" when logged in (if you're a committer).
>>> According to [2] this will prevent it from picking up new jobs after it's
>>> finished the currently executing ones. Doing that manually for every worker
>>> could be a pain though.
>>>
>>> Brian
>>>
>>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>>> [2]
>>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>>
>>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <da...@google.com>
>>> wrote:
>>>
>>>> Hey everyone,
>>>>
>>>> I'm aiming at upgrading the version of Go on our Jenkins VMs, and I
>>>> found these instructions on upgrading software on Jenkins
>>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>>> our cwiki.
>>>>
>>>> I haven't started going through it yet, but I was wondering about the
>>>> last few steps that involve stopping VMs, deleting boot disks, and
>>>> restarting executors. Is there some best practice for that section to avoid
>>>> causing interruptions in our automated testing? Should I be trying to do
>>>> this outside of peak dev hours, or going one VM at a time so others can
>>>> pick up extra load, or anything like that?
>>>>
>>>> Thanks,
>>>> Daniel Oliveira
>>>>
>>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Daniel Oliveira <da...@google.com>.
@Brian Hulette <bh...@google.com> That button seems like exactly what
we'd need. Doing it manually would be a pain, but it's probably still
preferable to causing a bunch of aborted tests.

@Valentyn Tymofieiev <va...@google.com> Collaborating to do both updates
at once is a great idea! I'll message you directly about it.

On Wed, Sep 22, 2021 at 2:44 PM Valentyn Tymofieiev <va...@google.com>
wrote:

> I am also interested in this updating version of Python on VMs, I need to
> install Python 3.9. Thanks for looking into this.  We can coordinate
> together to make one update instead of two.
>
> On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <bh...@google.com> wrote:
>
>> I'm not sure about best practices here. Out of curiosity I just poked
>> around in the Jenkins UI (e.g. [1]) and it looks like you can manually
>> "Mark node temporarily offline" when logged in (if you're a committer).
>> According to [2] this will prevent it from picking up new jobs after it's
>> finished the currently executing ones. Doing that manually for every worker
>> could be a pain though.
>>
>> Brian
>>
>> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
>> [2]
>> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>>
>> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <da...@google.com>
>> wrote:
>>
>>> Hey everyone,
>>>
>>> I'm aiming at upgrading the version of Go on our Jenkins VMs, and I
>>> found these instructions on upgrading software on Jenkins
>>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>>> our cwiki.
>>>
>>> I haven't started going through it yet, but I was wondering about the
>>> last few steps that involve stopping VMs, deleting boot disks, and
>>> restarting executors. Is there some best practice for that section to avoid
>>> causing interruptions in our automated testing? Should I be trying to do
>>> this outside of peak dev hours, or going one VM at a time so others can
>>> pick up extra load, or anything like that?
>>>
>>> Thanks,
>>> Daniel Oliveira
>>>
>>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Valentyn Tymofieiev <va...@google.com>.
I am also interested in this updating version of Python on VMs, I need to
install Python 3.9. Thanks for looking into this.  We can coordinate
together to make one update instead of two.

On Wed, Sep 22, 2021 at 2:40 PM Brian Hulette <bh...@google.com> wrote:

> I'm not sure about best practices here. Out of curiosity I just poked
> around in the Jenkins UI (e.g. [1]) and it looks like you can manually
> "Mark node temporarily offline" when logged in (if you're a committer).
> According to [2] this will prevent it from picking up new jobs after it's
> finished the currently executing ones. Doing that manually for every worker
> could be a pain though.
>
> Brian
>
> [1] https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
> [2]
> https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni
>
> On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <da...@google.com>
> wrote:
>
>> Hey everyone,
>>
>> I'm aiming at upgrading the version of Go on our Jenkins VMs, and I found these
>> instructions on upgrading software on Jenkins
>> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
>> our cwiki.
>>
>> I haven't started going through it yet, but I was wondering about the
>> last few steps that involve stopping VMs, deleting boot disks, and
>> restarting executors. Is there some best practice for that section to avoid
>> causing interruptions in our automated testing? Should I be trying to do
>> this outside of peak dev hours, or going one VM at a time so others can
>> pick up extra load, or anything like that?
>>
>> Thanks,
>> Daniel Oliveira
>>
>

Re: Best practices for upgrading installed dependencies on Jenkins VMs?

Posted by Brian Hulette <bh...@google.com>.
I'm not sure about best practices here. Out of curiosity I just poked
around in the Jenkins UI (e.g. [1]) and it looks like you can manually
"Mark node temporarily offline" when logged in (if you're a committer).
According to [2] this will prevent it from picking up new jobs after it's
finished the currently executing ones. Doing that manually for every worker
could be a pain though.

Brian

[1] https://ci-beam.apache.org/computer/apache-beam-jenkins-13/
[2]
https://stackoverflow.com/questions/26553612/how-do-i-disable-a-node-in-jenkins-ui-after-it-has-completed-its-currently-runni

On Wed, Sep 22, 2021 at 1:03 PM Daniel Oliveira <da...@google.com>
wrote:

> Hey everyone,
>
> I'm aiming at upgrading the version of Go on our Jenkins VMs, and I found these
> instructions on upgrading software on Jenkins
> <https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips#JenkinsTips-HowtoinstallandupgradesoftwareonJenkinsworkers> on
> our cwiki.
>
> I haven't started going through it yet, but I was wondering about the last
> few steps that involve stopping VMs, deleting boot disks, and restarting
> executors. Is there some best practice for that section to avoid causing
> interruptions in our automated testing? Should I be trying to do this
> outside of peak dev hours, or going one VM at a time so others can pick up
> extra load, or anything like that?
>
> Thanks,
> Daniel Oliveira
>