You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Aaron Wood <aa...@verizon.com> on 2017/01/04 00:26:48 UTC

Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/
-----------------------------------------------------------

(Updated Jan. 4, 2017, 12:26 a.m.)


Review request for mesos and Jie Yu.


Changes
-------

Created stack class so that the logic for creating an aligned stack can be shared. Used this new class in another area of the codebase where a stack was being allocated.


Bugs: MESOS-6835
    https://issues.apache.org/jira/browse/MESOS-6835


Repository: mesos


Description
-------

Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)

Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.


Diffs (updated)
-----

  3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
  src/linux/ns.hpp 77789717e 

Diff: https://reviews.apache.org/r/54996/diff/


Testing
-------

Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).


Thanks,

Aaron Wood


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Aaron Wood <aa...@verizon.com>.

> On Jan. 4, 2017, 5:58 p.m., Jie Yu wrote:
> > 3rdparty/stout/include/stout/os/linux.hpp, lines 57-60
> > <https://reviews.apache.org/r/54996/diff/2/?file=1596476#file1596476line57>
> >
> >     I would actually suggest keep this `create` method, but move the allocation logic in `create`. 
> >     
> >     ```
> >     static Try<Stack> create(size_t size)
> >     {
> >       Stack stack(size);
> >       
> >       if (posix_memalign(...) != 0) {
> >         return ErrnoError("Failed to allocate stack");
> >       }
> >       
> >       return stack;
> >     }
> >     ```
> >     
> >     We can get rid of the `allocate` function. Once created, it's by default allocated.

`address` and `size` would need to be static for this. That opens up another issue since all stacks would share the same data. Would it be better to get rid of the private constructor, delete the copy constructor and assignment, and just have `allocate()` and `deallocate()`?


- Aaron


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160513
-----------------------------------------------------------


On Jan. 4, 2017, 12:26 a.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 12:26 a.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Jie Yu <yu...@gmail.com>.

> On Jan. 4, 2017, 5:58 p.m., Jie Yu wrote:
> > 3rdparty/stout/include/stout/os/linux.hpp, lines 57-60
> > <https://reviews.apache.org/r/54996/diff/2/?file=1596476#file1596476line57>
> >
> >     I would actually suggest keep this `create` method, but move the allocation logic in `create`. 
> >     
> >     ```
> >     static Try<Stack> create(size_t size)
> >     {
> >       Stack stack(size);
> >       
> >       if (posix_memalign(...) != 0) {
> >         return ErrnoError("Failed to allocate stack");
> >       }
> >       
> >       return stack;
> >     }
> >     ```
> >     
> >     We can get rid of the `allocate` function. Once created, it's by default allocated.
> 
> Aaron Wood wrote:
>     `address` and `size` would need to be static for this. That opens up another issue since all stacks would share the same data. Would it be better to get rid of the private constructor, delete the copy constructor and assignment, and just have `allocate()` and `deallocate()`?

Why? you can do `stack.address` and `stack.size`.  constructor is private, meaning that only 'create' can construct the Stack. Removing copy and assignment operator sounds good. Can you try that see if that compiles?


- Jie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160513
-----------------------------------------------------------


On Jan. 4, 2017, 12:26 a.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 12:26 a.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Aaron Wood <aa...@verizon.com>.

> On Jan. 4, 2017, 5:58 p.m., Jie Yu wrote:
> > 3rdparty/stout/include/stout/os/linux.hpp, lines 57-60
> > <https://reviews.apache.org/r/54996/diff/2/?file=1596476#file1596476line57>
> >
> >     I would actually suggest keep this `create` method, but move the allocation logic in `create`. 
> >     
> >     ```
> >     static Try<Stack> create(size_t size)
> >     {
> >       Stack stack(size);
> >       
> >       if (posix_memalign(...) != 0) {
> >         return ErrnoError("Failed to allocate stack");
> >       }
> >       
> >       return stack;
> >     }
> >     ```
> >     
> >     We can get rid of the `allocate` function. Once created, it's by default allocated.
> 
> Aaron Wood wrote:
>     `address` and `size` would need to be static for this. That opens up another issue since all stacks would share the same data. Would it be better to get rid of the private constructor, delete the copy constructor and assignment, and just have `allocate()` and `deallocate()`?
> 
> Jie Yu wrote:
>     Why? you can do `stack.address` and `stack.size`.  constructor is private, meaning that only 'create' can construct the Stack. Removing copy and assignment operator sounds good. Can you try that see if that compiles?

You're right, I was not thinking of using the local `stack` object directly.
Note that this only compiles with deleting the `=` operator since, as you had said, the copy constructor `Try(const U& u)` is invoked underneath.


- Aaron


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160513
-----------------------------------------------------------


On Jan. 4, 2017, 9:28 p.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 9:28 p.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Jie Yu <yu...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160513
-----------------------------------------------------------




3rdparty/stout/include/stout/os/linux.hpp (lines 57 - 60)
<https://reviews.apache.org/r/54996/#comment231647>

    I would actually suggest keep this `create` method, but move the allocation logic in `create`. 
    
    ```
    static Try<Stack> create(size_t size)
    {
      Stack stack(size);
      
      if (posix_memalign(...) != 0) {
        return ErrnoError("Failed to allocate stack");
      }
      
      return stack;
    }
    ```
    
    We can get rid of the `allocate` function. Once created, it's by default allocated.



3rdparty/stout/include/stout/os/linux.hpp (line 58)
<https://reviews.apache.org/r/54996/#comment231648>

    Please refer to our style guide:
    https://github.com/apache/mesos/blob/master/docs/c%2B%2B-style-guide.md
    
    You should move the tailing `{` to the next line. Ditto everywhere else.



3rdparty/stout/include/stout/os/linux.hpp (lines 74 - 76)
<https://reviews.apache.org/r/54996/#comment231649>

    add a space after `if`



3rdparty/stout/include/stout/os/linux.hpp (line 75)
<https://reviews.apache.org/r/54996/#comment231650>

    +1 to James's suggestion.



3rdparty/stout/include/stout/os/linux.hpp (lines 102 - 111)
<https://reviews.apache.org/r/54996/#comment231655>

    I would simply the logics here:
    ```
    if (stack.isNone()) {
      Try<Stack> _stack = Stack::create(...);
      if (_stack.isError()) {
        return -1;
      }
      
      stack = _stack.get();
      cleanup = true;
    }
    
    char* address = stack->address + stack->size;
    
    ...
    
    if (cleanup && ...) {
      stack->deallocate();
    }
    ```



3rdparty/stout/include/stout/os/linux.hpp (line 118)
<https://reviews.apache.org/r/54996/#comment231651>

    Let's try to use `char*` consistently. Pointer math on void pointers is disallowed in both C and C++.



src/linux/ns.hpp (lines 432 - 439)
<https://reviews.apache.org/r/54996/#comment231656>

    ```
    Try<Stack> stack = os::Stack::create(...);
    if (stack.isError()) {
      return Error("Failed to allocate stack: " + stack.error());
    }
    
    ...
    ```


- Jie Yu


On Jan. 4, 2017, 12:26 a.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 12:26 a.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Aaron Wood <aa...@verizon.com>.

> On Jan. 4, 2017, 5:41 a.m., James Peach wrote:
> > 3rdparty/stout/include/stout/os/linux.hpp, line 120
> > <https://reviews.apache.org/r/54996/diff/2/?file=1596476#file1596476line120>
> >
> >     Consider hoisting this into the `Stack` class:
> >     
> >     ```
> >     void * Stack::start() {
> >       return (uint8_t *)address + size;
> >     }
> >     ```
> 
> Aaron Wood wrote:
>     Don't we want to avoid C-style casts?
> 
> James Peach wrote:
>     Sure, you could `static_cast` here. Not much difference imho :)
> 
> Aaron Wood wrote:
>     True. I thought it was part of the Mesos style guide which is why I ask.
> 
> Aaron Wood wrote:
>     Dropping this to stick with `char*` instead.

Actually, let me mark this as fixed since I did add a `start()` method, it just doesn't return a `void*`.


- Aaron


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160464
-----------------------------------------------------------


On Jan. 4, 2017, 9:28 p.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 9:28 p.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Aaron Wood <aa...@verizon.com>.

> On Jan. 4, 2017, 5:41 a.m., James Peach wrote:
> > 3rdparty/stout/include/stout/os/linux.hpp, line 120
> > <https://reviews.apache.org/r/54996/diff/2/?file=1596476#file1596476line120>
> >
> >     Consider hoisting this into the `Stack` class:
> >     
> >     ```
> >     void * Stack::start() {
> >       return (uint8_t *)address + size;
> >     }
> >     ```
> 
> Aaron Wood wrote:
>     Don't we want to avoid C-style casts?
> 
> James Peach wrote:
>     Sure, you could `static_cast` here. Not much difference imho :)
> 
> Aaron Wood wrote:
>     True. I thought it was part of the Mesos style guide which is why I ask.

Dropping this to stick with `char*` instead.


- Aaron


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160464
-----------------------------------------------------------


On Jan. 4, 2017, 9:28 p.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 9:28 p.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Aaron Wood <aa...@verizon.com>.

> On Jan. 4, 2017, 5:41 a.m., James Peach wrote:
> > 3rdparty/stout/include/stout/os/linux.hpp, line 120
> > <https://reviews.apache.org/r/54996/diff/2/?file=1596476#file1596476line120>
> >
> >     Consider hoisting this into the `Stack` class:
> >     
> >     ```
> >     void * Stack::start() {
> >       return (uint8_t *)address + size;
> >     }
> >     ```
> 
> Aaron Wood wrote:
>     Don't we want to avoid C-style casts?
> 
> James Peach wrote:
>     Sure, you could `static_cast` here. Not much difference imho :)

True. I thought it was part of the Mesos style guide which is why I ask.


- Aaron


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160464
-----------------------------------------------------------


On Jan. 4, 2017, 12:26 a.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 12:26 a.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Aaron Wood <aa...@verizon.com>.

> On Jan. 4, 2017, 5:41 a.m., James Peach wrote:
> > 3rdparty/stout/include/stout/os/linux.hpp, line 120
> > <https://reviews.apache.org/r/54996/diff/2/?file=1596476#file1596476line120>
> >
> >     Consider hoisting this into the `Stack` class:
> >     
> >     ```
> >     void * Stack::start() {
> >       return (uint8_t *)address + size;
> >     }
> >     ```

Don't we want to avoid C-style casts?


- Aaron


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160464
-----------------------------------------------------------


On Jan. 4, 2017, 12:26 a.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 12:26 a.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Aaron Wood <aa...@verizon.com>.

> On Jan. 4, 2017, 5:41 a.m., James Peach wrote:
> > 3rdparty/stout/include/stout/os/linux.hpp, line 63
> > <https://reviews.apache.org/r/54996/diff/2/?file=1596476#file1596476line63>
> >
> >     I'd recommend something like this:
> >     ```
> >     Try<Nothing> allocate(size_t nbytes) {
> >         int error = ::posix_memalign(&address, os::getpagesize(), nbytes);
> >         if (error) {
> >             return ErrnoError(error);
> >         }
> >         
> >         return Nothing();
> >     }
> >     ```

While my changes are not 100% aligned with what you suggested here, I did move to using `getpagesize()`.


> On Jan. 4, 2017, 5:41 a.m., James Peach wrote:
> > 3rdparty/stout/include/stout/os/linux.hpp, line 60
> > <https://reviews.apache.org/r/54996/diff/2/?file=1596476#file1596476line60>
> >
> >     I don't think that you need this static constructor. You can do everything you need in `allocate`, which is them symmetric with `deallocate`.

Dropping since Jie wants to keep the private constructor.


- Aaron


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160464
-----------------------------------------------------------


On Jan. 4, 2017, 9:28 p.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 9:28 p.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Aaron Wood <aa...@verizon.com>.

> On Jan. 4, 2017, 5:41 a.m., James Peach wrote:
> > 3rdparty/stout/include/stout/os/linux.hpp, line 82
> > <https://reviews.apache.org/r/54996/diff/2/?file=1596476#file1596476line82>
> >
> >     You should explictly delete the copy constructors here to ensure the stack can't be leaked if the code changes:
> >     
> >     ```
> >     Stack(const Stack&) = delete;
> >     Stack& operator=(const Stack&) = delete;
> >     ```

See comment below about why the copy constructor cannot be deleted in this case.


- Aaron


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160464
-----------------------------------------------------------


On Jan. 4, 2017, 9:28 p.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 9:28 p.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by James Peach <jp...@apache.org>.

> On Jan. 4, 2017, 5:41 a.m., James Peach wrote:
> > 3rdparty/stout/include/stout/os/linux.hpp, line 120
> > <https://reviews.apache.org/r/54996/diff/2/?file=1596476#file1596476line120>
> >
> >     Consider hoisting this into the `Stack` class:
> >     
> >     ```
> >     void * Stack::start() {
> >       return (uint8_t *)address + size;
> >     }
> >     ```
> 
> Aaron Wood wrote:
>     Don't we want to avoid C-style casts?

Sure, you could `static_cast` here. Not much difference imho :)


- James


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160464
-----------------------------------------------------------


On Jan. 4, 2017, 12:26 a.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 12:26 a.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by James Peach <jp...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160464
-----------------------------------------------------------




3rdparty/stout/include/stout/os/linux.hpp (line 60)
<https://reviews.apache.org/r/54996/#comment231621>

    I don't think that you need this static constructor. You can do everything you need in `allocate`, which is them symmetric with `deallocate`.



3rdparty/stout/include/stout/os/linux.hpp (line 63)
<https://reviews.apache.org/r/54996/#comment231622>

    I'd recommend something like this:
    ```
    Try<Nothing> allocate(size_t nbytes) {
        int error = ::posix_memalign(&address, os::getpagesize(), nbytes);
        if (error) {
            return ErrnoError(error);
        }
        
        return Nothing();
    }
    ```



3rdparty/stout/include/stout/os/linux.hpp (line 75)
<https://reviews.apache.org/r/54996/#comment231625>

    You can simplify this to just:
    ```
    void deallocate() {
      ::free(address);
      
      address = nullptr;
      size = 0;
    }
    ```



3rdparty/stout/include/stout/os/linux.hpp (line 79)
<https://reviews.apache.org/r/54996/#comment231624>

    You can avoid the `reinterpret_cast` by typing this as `void *`. If you add the `start()` method suggested below, you can also make it private.



3rdparty/stout/include/stout/os/linux.hpp (line 82)
<https://reviews.apache.org/r/54996/#comment231620>

    You should explictly delete the copy constructors here to ensure the stack can't be leaked if the code changes:
    
    ```
    Stack(const Stack&) = delete;
    Stack& operator=(const Stack&) = delete;
    ```



3rdparty/stout/include/stout/os/linux.hpp (line 105)
<https://reviews.apache.org/r/54996/#comment231626>

    Since this magic number appears twice I'd be inclined to hoist it into `Stack`:
    ```
    class Stack {
      ...
      static constexpr size_t DefaultSize = 8 * 1024 * 1024;
      ...
    };
    ```



3rdparty/stout/include/stout/os/linux.hpp (line 118)
<https://reviews.apache.org/r/54996/#comment231623>

    Consider hoisting this into the `Stack` class:
    
    ```
    void * Stack::start() {
      return (uint8_t *)address + size;
    }
    ```


- James Peach


On Jan. 4, 2017, 12:26 a.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 12:26 a.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Aaron Wood <aa...@verizon.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160541
-----------------------------------------------------------




3rdparty/stout/include/stout/os/linux.hpp (line 72)
<https://reviews.apache.org/r/54996/#comment231691>

    Looks like `posix_memalign` never sets `errno`. Need to change this to return `Error`.


- Aaron Wood


On Jan. 4, 2017, 9:28 p.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 9:28 p.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Mesos ReviewBot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160937
-----------------------------------------------------------



Patch looks great!

Reviews applied: [54996]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker_build.sh

- Mesos ReviewBot


On Jan. 9, 2017, 4:42 p.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 9, 2017, 4:42 p.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> `make check -j4` also passes with no errors.
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Aaron Wood <aa...@verizon.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/
-----------------------------------------------------------

(Updated Jan. 9, 2017, 4:42 p.m.)


Review request for mesos and Jie Yu.


Bugs: MESOS-6835
    https://issues.apache.org/jira/browse/MESOS-6835


Repository: mesos


Description
-------

Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)

Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.


Diffs (updated)
-----

  3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
  src/linux/ns.hpp 77789717e 

Diff: https://reviews.apache.org/r/54996/diff/


Testing
-------

Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).

`make check -j4` also passes with no errors.


Thanks,

Aaron Wood


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Aaron Wood <aa...@verizon.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/
-----------------------------------------------------------

(Updated Jan. 9, 2017, 4:19 p.m.)


Review request for mesos and Jie Yu.


Bugs: MESOS-6835
    https://issues.apache.org/jira/browse/MESOS-6835


Repository: mesos


Description
-------

Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)

Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.


Diffs
-----

  3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
  src/linux/ns.hpp 77789717e 

Diff: https://reviews.apache.org/r/54996/diff/


Testing (updated)
-------

Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).

`make check -j4` also passes with no errors.


Thanks,

Aaron Wood


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Mesos ReviewBot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160652
-----------------------------------------------------------



Patch looks great!

Reviews applied: [54996]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker_build.sh

- Mesos ReviewBot


On Jan. 4, 2017, 9:55 p.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 9:55 p.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Aaron Wood <aa...@verizon.com>.

> On Jan. 7, 2017, 6:42 p.m., Jie Yu wrote:
> > 3rdparty/stout/include/stout/os/linux.hpp, line 72
> > <https://reviews.apache.org/r/54996/diff/4/?file=1596764#file1596764line72>
> >
> >     Please use errno error so that errno are in the error string.
> >     
> >     `return ErrnoError("Memory allocation failed");`

`posix_memalign` never sets `errno`. Shouldn't we just return `Error`?


- Aaron


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160821
-----------------------------------------------------------


On Jan. 4, 2017, 9:55 p.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 9:55 p.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by James Peach <jp...@apache.org>.

> On Jan. 7, 2017, 6:42 p.m., Jie Yu wrote:
> > 3rdparty/stout/include/stout/os/linux.hpp, line 72
> > <https://reviews.apache.org/r/54996/diff/4/?file=1596764#file1596764line72>
> >
> >     Please use errno error so that errno are in the error string.
> >     
> >     `return ErrnoError("Memory allocation failed");`
> 
> Aaron Wood wrote:
>     `posix_memalign` never sets `errno`. Shouldn't we just return `Error`?

You can do this:
```
int err = posix_memalign(...);
if (err) {
  return ErrnoError(err);
}
```


- James


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160821
-----------------------------------------------------------


On Jan. 9, 2017, 4:19 p.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 9, 2017, 4:19 p.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> `make check -j4` also passes with no errors.
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Jie Yu <yu...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160821
-----------------------------------------------------------


Fix it, then Ship it!





3rdparty/stout/include/stout/os/linux.hpp (line 58)
<https://reviews.apache.org/r/54996/#comment232048>

    s/DefaultSize/DEFAULT_SIZE/



3rdparty/stout/include/stout/os/linux.hpp (line 60)
<https://reviews.apache.org/r/54996/#comment232047>

    It's weird that we allow copy constructor but disallow assignment operator. Let's just remove this.



3rdparty/stout/include/stout/os/linux.hpp (line 66)
<https://reviews.apache.org/r/54996/#comment232046>

    I typically prefer one extra blank line before this comment



3rdparty/stout/include/stout/os/linux.hpp (lines 70 - 71)
<https://reviews.apache.org/r/54996/#comment232044>

    I would prefer align the code like the following:
    ```
    if (::posix_memalign(
        reinterpret_cast<void**>(&stack.address),
        os::pagesize(),
        stack.size) != 0) {
      return ErrnoError("Memory allocation failed");
    }
    ```



3rdparty/stout/include/stout/os/linux.hpp (line 71)
<https://reviews.apache.org/r/54996/#comment232043>

    please use os::pagesize() here.



3rdparty/stout/include/stout/os/linux.hpp (line 72)
<https://reviews.apache.org/r/54996/#comment232045>

    Please use errno error so that errno are in the error string.
    
    `return ErrnoError("Memory allocation failed");`


- Jie Yu


On Jan. 4, 2017, 9:55 p.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 9:55 p.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Aaron Wood <aa...@verizon.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/
-----------------------------------------------------------

(Updated Jan. 4, 2017, 9:55 p.m.)


Review request for mesos and Jie Yu.


Changes
-------

`posix_memalign` does not set `errno`.


Bugs: MESOS-6835
    https://issues.apache.org/jira/browse/MESOS-6835


Repository: mesos


Description
-------

Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)

Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.


Diffs (updated)
-----

  3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
  src/linux/ns.hpp 77789717e 

Diff: https://reviews.apache.org/r/54996/diff/


Testing
-------

Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).


Thanks,

Aaron Wood


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Aaron Wood <aa...@verizon.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/
-----------------------------------------------------------

(Updated Jan. 4, 2017, 9:28 p.m.)


Review request for mesos and Jie Yu.


Bugs: MESOS-6835
    https://issues.apache.org/jira/browse/MESOS-6835


Repository: mesos


Description
-------

Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)

Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.


Diffs (updated)
-----

  3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
  src/linux/ns.hpp 77789717e 

Diff: https://reviews.apache.org/r/54996/diff/


Testing
-------

Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).


Thanks,

Aaron Wood


Re: Review Request 54996: Fix SIGBUS crash on ARM64/AArch64.

Posted by Aaron Wood <aa...@verizon.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54996/#review160454
-----------------------------------------------------------




3rdparty/stout/include/stout/os/linux.hpp (line 68)
<https://reviews.apache.org/r/54996/#comment231598>

    Anyone see a way around this reinterpret_cast?


- Aaron Wood


On Jan. 4, 2017, 12:26 a.m., Aaron Wood wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54996/
> -----------------------------------------------------------
> 
> (Updated Jan. 4, 2017, 12:26 a.m.)
> 
> 
> Review request for mesos and Jie Yu.
> 
> 
> Bugs: MESOS-6835
>     https://issues.apache.org/jira/browse/MESOS-6835
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently in the Linux launcher when the stack is allocated and prepared for a call to clone() it is not properly aligned. This is not an issue for x86 or x64 but for ARM64/AArch64 it is because of the requirement of having the stack aligned to a 16 byte boundary. While x86 and x64 also expect the stack to have a 16 byte aligned stack, it is not enforced. An explanation of the stack and requirements for ARM64 can be found here http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf (specifically section 5.2.2.1 that says SP mod 16 = 0. The stack must be quad-word aligned.)
> 
> Additionally, the way that the stack is currently allocated and passed to clone() accidentally chops off one entry, making a stack overflow using those missing 8 bytes a possibility. Fixing this while aligning the memory will fix both the issue of the stack overflow issue as well as the SIGBUS crash. We should also net better performance from having the stack aligned.
> 
> 
> Diffs
> -----
> 
>   3rdparty/stout/include/stout/os/linux.hpp 530f1a55b 
>   src/linux/ns.hpp 77789717e 
> 
> Diff: https://reviews.apache.org/r/54996/diff/
> 
> 
> Testing
> -------
> 
> Built Mesos from source and am currently running it in a test cluster. Launched both Docker and Mesos tasks via Marathon without any resulting crash (initial crash only happened with Mesos containerizer + linux_launcher, not with the posix_launcher).
> 
> 
> Thanks,
> 
> Aaron Wood
> 
>