You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Yifan Zou <yi...@google.com> on 2019/07/01 17:00:53 UTC
Re: apache-beam-jenkins-15 out of disk
https://issues.apache.org/jira/browse/BEAM-7650 tracks the docker issue.
On Sun, Jun 30, 2019 at 2:35 PM Mark Liu <ma...@google.com> wrote:
> Thank you for triaging and working out a solution Yifan and Ankur.
>
> Ankur, from what you discovered, we should fix this race condition
> otherwise same problem will happen in the future. Is there a jira tracking
> this issue?
>
> On Fri, Jun 28, 2019 at 4:56 PM Yifan Zou <yi...@google.com> wrote:
>
>> Sorry for the inconvenience. I disabled the worker. I'll need more time
>> to restore it.
>>
>> On Fri, Jun 28, 2019 at 3:56 PM Daniel Oliveira <da...@google.com>
>> wrote:
>>
>>> Any updates to this issue today? It seems like this (or a similar bug)
>>> is still happening across many Pre and Postcommits.
>>>
>>> On Fri, Jun 28, 2019 at 12:33 AM Yifan Zou <yi...@google.com> wrote:
>>>
>>>> I did the prune on beam15. The disk was free but all jobs fails with
>>>> other weird problems. Looks like docker prune overkills, but I don't have
>>>> evidence. Will look further in AM.
>>>>
>>>> On Thu, Jun 27, 2019 at 11:20 PM Udi Meiri <eh...@google.com> wrote:
>>>>
>>>>> See how the hdfs IT already avoids tag collisions.
>>>>>
>>>>> On Thu, Jun 27, 2019, 20:42 Yichi Zhang <zy...@google.com> wrote:
>>>>>
>>>>>> for flakiness I guess a tag is needed to separate concurrent build
>>>>>> apart.
>>>>>>
>>>>>> On Thu, Jun 27, 2019 at 8:39 PM Yichi Zhang <zy...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> maybe a cron job on jenkins node that does docker prune every day?
>>>>>>>
>>>>>>> On Thu, Jun 27, 2019 at 6:58 PM Ankur Goenka <go...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> This highlights the race condition caused by using single docker
>>>>>>>> registry on a machine.
>>>>>>>> If 2 tests create "jenkins-docker-apache.bintray.io/beam/python" one
>>>>>>>> after another then the 2nd one will replace the 1st one and cause flakyness.
>>>>>>>>
>>>>>>>> Is their a way to dynamically create and destroy docker repository
>>>>>>>> on a machine and clean all the relevant data?
>>>>>>>>
>>>>>>>> On Thu, Jun 27, 2019 at 3:15 PM Yifan Zou <yi...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The problem was because of the large quantity of stale docker
>>>>>>>>> images generated by the Python portable tests and HDFS IT.
>>>>>>>>>
>>>>>>>>> Dumping the docker disk usage gives me:
>>>>>>>>>
>>>>>>>>> TYPE TOTAL ACTIVE SIZE
>>>>>>>>> RECLAIMABLE
>>>>>>>>> *Images 1039 356 424GB
>>>>>>>>> 384.2GB (90%)*
>>>>>>>>> Containers 987 2
>>>>>>>>> 2.042GB 2.041GB (99%)
>>>>>>>>> Local Volumes 126 0
>>>>>>>>> 392.8MB 392.8MB (100%)
>>>>>>>>>
>>>>>>>>> REPOSITORY
>>>>>>>>> TAG IMAGE ID CREATED
>>>>>>>>> SIZE SHARED SIZE UNIQUE SIZE CONTAINERS
>>>>>>>>> jenkins-docker-apache.bintray.io/beam/python3
>>>>>>>>> latest ff1b949f4442 22 hours ago 1.639GB
>>>>>>>>> 922.3MB 716.9MB 0
>>>>>>>>> jenkins-docker-apache.bintray.io/beam/python
>>>>>>>>> latest 1dda7b9d9748 22 hours ago 1.624GB
>>>>>>>>> 913.7MB 710.3MB 0
>>>>>>>>> <none>
>>>>>>>>> <none> 05458187a0e3 22 hours
>>>>>>>>> ago 732.9MB 625.1MB 107.8MB 4
>>>>>>>>> <none>
>>>>>>>>> <none> 896f35dd685f 23 hours
>>>>>>>>> ago 1.639GB 922.3MB 716.9MB 0
>>>>>>>>> <none>
>>>>>>>>> <none> db4d24ca9f2b 23 hours
>>>>>>>>> ago 1.624GB 913.7MB 710.3MB 0
>>>>>>>>> <none>
>>>>>>>>> <none> 547df4d71c31 23 hours
>>>>>>>>> ago 732.9MB 625.1MB 107.8MB 4
>>>>>>>>> <none>
>>>>>>>>> <none> dd7d9582c3e0 23 hours
>>>>>>>>> ago 1.639GB 922.3MB 716.9MB 0
>>>>>>>>> <none>
>>>>>>>>> <none> 664aae255239 23 hours
>>>>>>>>> ago 1.624GB 913.7MB 710.3MB 0
>>>>>>>>> <none>
>>>>>>>>> <none> b528fedf9228 23
>>>>>>>>> hours ago 732.9MB 625.1MB 107.8MB
>>>>>>>>> 4
>>>>>>>>> <none>
>>>>>>>>> <none> 8e996f22435e 25
>>>>>>>>> hours ago 1.624GB 913.7MB 710.3MB
>>>>>>>>> 0
>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify_pr-818_test
>>>>>>>>> latest 24b73b3fec06 25 hours ago 1.305GB
>>>>>>>>> 965.7MB 339.5MB 0
>>>>>>>>> <none>
>>>>>>>>> <none> 096325fb48de 25 hours
>>>>>>>>> ago 732.9MB 625.1MB 107.8MB 2
>>>>>>>>> jenkins-docker-apache.bintray.io/beam/java
>>>>>>>>> latest c36d8ff2945d 25 hours ago
>>>>>>>>> 685.6MB 625.1MB 60.52MB 0
>>>>>>>>> <none>
>>>>>>>>> <none> 11c86ebe025f 26 hours
>>>>>>>>> ago 1.639GB 922.3MB 716.9MB 0
>>>>>>>>> <none>
>>>>>>>>> <none> 2ecd69c89ec1 26 hours
>>>>>>>>> ago 1.624GB 913.7MB 710.3MB 0
>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8590_test
>>>>>>>>> latest 3d1d589d44fe 2 days ago 1.305GB
>>>>>>>>> 965.7MB 339.5MB 0
>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify_pr-801_test
>>>>>>>>> latest d1cc503ebe8e 2 days ago 1.305GB
>>>>>>>>> 965.7MB 339.2MB 0
>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8577_test
>>>>>>>>> latest 8582c6ca6e15 3 days ago 1.305GB
>>>>>>>>> 965.7MB 339.2MB 0
>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8576_test
>>>>>>>>> latest 4591e0948170 3 days ago 1.305GB
>>>>>>>>> 965.7MB 339.2MB 0
>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8575_test
>>>>>>>>> latest ab181c49d56e 4 days ago 1.305GB
>>>>>>>>> 965.7MB 339.2MB 0
>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8573_test
>>>>>>>>> latest 2104ba0a6db7 4 days ago 1.305GB
>>>>>>>>> 965.7MB 339.2MB 0
>>>>>>>>> ...
>>>>>>>>> <1000+ images>
>>>>>>>>>
>>>>>>>>> I removed unused the images and the beam15 is back now.
>>>>>>>>>
>>>>>>>>> Opened https://issues.apache.org/jira/browse/BEAM-7650.
>>>>>>>>> Ankur, I assigned the issue to you. Feel free to reassign it if
>>>>>>>>> needed.
>>>>>>>>>
>>>>>>>>> Thank you.
>>>>>>>>> Yifan
>>>>>>>>>
>>>>>>>>> On Thu, Jun 27, 2019 at 11:29 AM Yifan Zou <yi...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Something were eating the disk. Disconnected the worker so jobs
>>>>>>>>>> could be allocated to other nodes. Will look deeper.
>>>>>>>>>> Filesystem Size Used Avail Use% Mounted on
>>>>>>>>>> /dev/sda1 485G 485G 96K 100% /
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jun 27, 2019 at 10:54 AM Yifan Zou <yi...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'm on it.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jun 27, 2019 at 10:17 AM Udi Meiri <eh...@google.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Opened a bug here:
>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-7648
>>>>>>>>>>>>
>>>>>>>>>>>> Can someone investigate what's going on?
>>>>>>>>>>>>
>>>>>>>>>>>
Re: apache-beam-jenkins-15 out of disk
Posted by Yifan Zou <yi...@google.com>.
I reimaged the beam15. The worker is re-enabled. Let us know if anything
weird happens on any agent.
Thanks.
Yifan
On Mon, Jul 1, 2019 at 10:00 AM Yifan Zou <yi...@google.com> wrote:
> https://issues.apache.org/jira/browse/BEAM-7650 tracks the docker issue.
>
> On Sun, Jun 30, 2019 at 2:35 PM Mark Liu <ma...@google.com> wrote:
>
>> Thank you for triaging and working out a solution Yifan and Ankur.
>>
>> Ankur, from what you discovered, we should fix this race condition
>> otherwise same problem will happen in the future. Is there a jira tracking
>> this issue?
>>
>> On Fri, Jun 28, 2019 at 4:56 PM Yifan Zou <yi...@google.com> wrote:
>>
>>> Sorry for the inconvenience. I disabled the worker. I'll need more time
>>> to restore it.
>>>
>>> On Fri, Jun 28, 2019 at 3:56 PM Daniel Oliveira <da...@google.com>
>>> wrote:
>>>
>>>> Any updates to this issue today? It seems like this (or a similar bug)
>>>> is still happening across many Pre and Postcommits.
>>>>
>>>> On Fri, Jun 28, 2019 at 12:33 AM Yifan Zou <yi...@google.com> wrote:
>>>>
>>>>> I did the prune on beam15. The disk was free but all jobs fails with
>>>>> other weird problems. Looks like docker prune overkills, but I don't have
>>>>> evidence. Will look further in AM.
>>>>>
>>>>> On Thu, Jun 27, 2019 at 11:20 PM Udi Meiri <eh...@google.com> wrote:
>>>>>
>>>>>> See how the hdfs IT already avoids tag collisions.
>>>>>>
>>>>>> On Thu, Jun 27, 2019, 20:42 Yichi Zhang <zy...@google.com> wrote:
>>>>>>
>>>>>>> for flakiness I guess a tag is needed to separate concurrent build
>>>>>>> apart.
>>>>>>>
>>>>>>> On Thu, Jun 27, 2019 at 8:39 PM Yichi Zhang <zy...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> maybe a cron job on jenkins node that does docker prune every day?
>>>>>>>>
>>>>>>>> On Thu, Jun 27, 2019 at 6:58 PM Ankur Goenka <go...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This highlights the race condition caused by using single docker
>>>>>>>>> registry on a machine.
>>>>>>>>> If 2 tests create "jenkins-docker-apache.bintray.io/beam/python" one
>>>>>>>>> after another then the 2nd one will replace the 1st one and cause flakyness.
>>>>>>>>>
>>>>>>>>> Is their a way to dynamically create and destroy docker repository
>>>>>>>>> on a machine and clean all the relevant data?
>>>>>>>>>
>>>>>>>>> On Thu, Jun 27, 2019 at 3:15 PM Yifan Zou <yi...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The problem was because of the large quantity of stale docker
>>>>>>>>>> images generated by the Python portable tests and HDFS IT.
>>>>>>>>>>
>>>>>>>>>> Dumping the docker disk usage gives me:
>>>>>>>>>>
>>>>>>>>>> TYPE TOTAL ACTIVE SIZE
>>>>>>>>>> RECLAIMABLE
>>>>>>>>>> *Images 1039 356
>>>>>>>>>> 424GB 384.2GB (90%)*
>>>>>>>>>> Containers 987 2
>>>>>>>>>> 2.042GB 2.041GB (99%)
>>>>>>>>>> Local Volumes 126 0
>>>>>>>>>> 392.8MB 392.8MB (100%)
>>>>>>>>>>
>>>>>>>>>> REPOSITORY
>>>>>>>>>> TAG IMAGE ID CREATED
>>>>>>>>>> SIZE SHARED SIZE UNIQUE SIZE CONTAINERS
>>>>>>>>>> jenkins-docker-apache.bintray.io/beam/python3
>>>>>>>>>> latest ff1b949f4442 22 hours ago 1.639GB
>>>>>>>>>> 922.3MB 716.9MB 0
>>>>>>>>>> jenkins-docker-apache.bintray.io/beam/python
>>>>>>>>>> latest 1dda7b9d9748 22 hours ago 1.624GB
>>>>>>>>>> 913.7MB 710.3MB 0
>>>>>>>>>> <none>
>>>>>>>>>> <none> 05458187a0e3 22 hours
>>>>>>>>>> ago 732.9MB 625.1MB 107.8MB 4
>>>>>>>>>> <none>
>>>>>>>>>> <none> 896f35dd685f 23 hours
>>>>>>>>>> ago 1.639GB 922.3MB 716.9MB 0
>>>>>>>>>> <none>
>>>>>>>>>> <none> db4d24ca9f2b 23 hours
>>>>>>>>>> ago 1.624GB 913.7MB 710.3MB 0
>>>>>>>>>> <none>
>>>>>>>>>> <none> 547df4d71c31 23
>>>>>>>>>> hours ago 732.9MB 625.1MB 107.8MB
>>>>>>>>>> 4
>>>>>>>>>> <none>
>>>>>>>>>> <none> dd7d9582c3e0 23
>>>>>>>>>> hours ago 1.639GB 922.3MB 716.9MB
>>>>>>>>>> 0
>>>>>>>>>> <none>
>>>>>>>>>> <none> 664aae255239 23
>>>>>>>>>> hours ago 1.624GB 913.7MB 710.3MB
>>>>>>>>>> 0
>>>>>>>>>> <none>
>>>>>>>>>> <none> b528fedf9228 23
>>>>>>>>>> hours ago 732.9MB 625.1MB 107.8MB
>>>>>>>>>> 4
>>>>>>>>>> <none>
>>>>>>>>>> <none> 8e996f22435e 25
>>>>>>>>>> hours ago 1.624GB 913.7MB 710.3MB
>>>>>>>>>> 0
>>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify_pr-818_test
>>>>>>>>>> latest 24b73b3fec06 25 hours ago 1.305GB
>>>>>>>>>> 965.7MB 339.5MB 0
>>>>>>>>>> <none>
>>>>>>>>>> <none> 096325fb48de 25 hours
>>>>>>>>>> ago 732.9MB 625.1MB 107.8MB 2
>>>>>>>>>> jenkins-docker-apache.bintray.io/beam/java
>>>>>>>>>> latest c36d8ff2945d 25 hours ago
>>>>>>>>>> 685.6MB 625.1MB 60.52MB 0
>>>>>>>>>> <none>
>>>>>>>>>> <none> 11c86ebe025f 26
>>>>>>>>>> hours ago 1.639GB 922.3MB 716.9MB
>>>>>>>>>> 0
>>>>>>>>>> <none>
>>>>>>>>>> <none> 2ecd69c89ec1 26
>>>>>>>>>> hours ago 1.624GB 913.7MB 710.3MB
>>>>>>>>>> 0
>>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8590_test
>>>>>>>>>> latest 3d1d589d44fe 2 days ago 1.305GB
>>>>>>>>>> 965.7MB 339.5MB 0
>>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify_pr-801_test
>>>>>>>>>> latest d1cc503ebe8e 2 days ago 1.305GB
>>>>>>>>>> 965.7MB 339.2MB 0
>>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8577_test
>>>>>>>>>> latest 8582c6ca6e15 3 days ago 1.305GB
>>>>>>>>>> 965.7MB 339.2MB 0
>>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8576_test
>>>>>>>>>> latest 4591e0948170 3 days ago 1.305GB
>>>>>>>>>> 965.7MB 339.2MB 0
>>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8575_test
>>>>>>>>>> latest ab181c49d56e 4 days ago 1.305GB
>>>>>>>>>> 965.7MB 339.2MB 0
>>>>>>>>>> hdfs_it-jenkins-beam_postcommit_python_verify-8573_test
>>>>>>>>>> latest 2104ba0a6db7 4 days ago 1.305GB
>>>>>>>>>> 965.7MB 339.2MB 0
>>>>>>>>>> ...
>>>>>>>>>> <1000+ images>
>>>>>>>>>>
>>>>>>>>>> I removed unused the images and the beam15 is back now.
>>>>>>>>>>
>>>>>>>>>> Opened https://issues.apache.org/jira/browse/BEAM-7650.
>>>>>>>>>> Ankur, I assigned the issue to you. Feel free to reassign it if
>>>>>>>>>> needed.
>>>>>>>>>>
>>>>>>>>>> Thank you.
>>>>>>>>>> Yifan
>>>>>>>>>>
>>>>>>>>>> On Thu, Jun 27, 2019 at 11:29 AM Yifan Zou <yi...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Something were eating the disk. Disconnected the worker so jobs
>>>>>>>>>>> could be allocated to other nodes. Will look deeper.
>>>>>>>>>>> Filesystem Size Used Avail Use% Mounted on
>>>>>>>>>>> /dev/sda1 485G 485G 96K 100% /
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jun 27, 2019 at 10:54 AM Yifan Zou <yi...@google.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I'm on it.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jun 27, 2019 at 10:17 AM Udi Meiri <eh...@google.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Opened a bug here:
>>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-7648
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can someone investigate what's going on?
>>>>>>>>>>>>>
>>>>>>>>>>>>