You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Charles Natali (Jira)" <ji...@apache.org> on 2021/08/02 17:49:00 UTC

[jira] [Commented] (MESOS-10226) test suite hangs on ARM64

    [ https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17391736#comment-17391736 ] 

Charles Natali commented on MESOS-10226:
----------------------------------------

Hm, it's annoying - the gdb backtrace you posted shows that the regtest gets stuck in this test, but for some reason  running this test on its own isn't enough to reproduce it.
It's going to be very difficult to debug without being able to run them myself.

> test suite hangs on ARM64
> -------------------------
>
>                 Key: MESOS-10226
>                 URL: https://issues.apache.org/jira/browse/MESOS-10226
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Charles Natali
>            Assignee: Charles Natali
>            Priority: Major
>         Attachments: gdb-thread-apply-bt-all-29.07.2021-2.txt, gdb-thread-apply-bt-all-29.07.2021.txt
>
>
> Reported by [~mgrigorov].
>  
> {noformat}
> [ RUN      ] NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
> sh: 1: hadoop: not found
> Marked '/' as rslave
> I0726 11:59:17.812630    32 exec.cpp:164] Version: 1.12.0
> I0726 11:59:17.827512    31 exec.cpp:237] Executor registered on agent 9076f44b-846d-4f00-a2dc-11f694cc1900-S0
> I0726 11:59:17.830999    36 executor.cpp:190] Received SUBSCRIBED event
> I0726 11:59:17.832351    36 executor.cpp:194] Subscribed executor on martin-arm64
> I0726 11:59:17.832775    36 executor.cpp:190] Received LAUNCH event
> I0726 11:59:17.834415    36 executor.cpp:722] Starting task d1bbb266-bee7-4c9d-929f-16aa41f4e9cf
> I0726 11:59:17.839910    36 executor.cpp:740] Forked command at 38
> Preparing rootfs at '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791'
> Changing root to /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791
> Failed to execute 'sh': Exec format error
> I0726 11:59:18.113488    33 executor.cpp:1041] Command exited with status 1 (pid: 38)
> ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:1111: Failure
> Mock function called more times than expected - returning directly.
>     Function call: statusUpdate(0xffffc28527f0, @0xffffa2cf3a60 136-byte object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 03-00 00-00>)
>          Expected: to be called twice
>            Actual: called 3 times - over-saturated and active
> I0726 11:59:19.117401    37 process.cpp:935] Stopped the socket accept loop{noformat}
>  
> I asked him to provide a gdb traceback and we can see the following:
>  
> {noformat}
> Thread 1 (Thread 0xffffa3bc2c60 (LWP 173475)):
> #0 0x0000ffffa518db20 in __libc_open64 (file=0xaaab00f342e0 "/tmp/7VXP3w/pipe", oflag=<optimized out>) at ../sysdeps/unix/sysv/linux/open64.c:48
> #1 0x0000ffffa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, filename=<optimized out>, posix_mode=<optimized out>, prot=prot@entry=438, read_write=8, is32not64=<optimized out>) at fileops.c:189
> #2 0x0000ffffa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, filename=filename@entry=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode=<optimized out>, mode@entry=0xaaaad762f3c8 "r", is32not64=is32not64@e
> ntry=1) at fileops.c:281 
> #3 0x0000ffffa512e0dc in __fopen_internal (filename=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode=0xaaaad762f3c8 "r", is32=1) at iofopen.c:75
> #4 0x0000aaaad54f5350 in os::read (path="/tmp/7VXP3w/pipe") at ../../3rdparty/stout/include/stout/os/read.hpp:136
> #5 0x0000aaaad74f1c1c in mesos::internal::tests::NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_Test::TestBody (this=0xaaab00f88f50) at ../../src/tests/containeri
> zer/nested_mesos_containerizer_tests.cpp:1126
> {noformat}
>  
>  
> Basically the test uses a named pipe to synchronize with the task being started, and if the task fails to start - in this case because we're trying to launch an x86 container on an arm64 host - the test will just hang reading from the pipe.
> I send Martin a tentative fix for him to test, and I'll open an MR if successful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)