You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2017/11/13 22:55:00 UTC

[jira] [Comment Edited] (MESOS-8129) Very large resource value crashes master

    [ https://issues.apache.org/jira/browse/MESOS-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250290#comment-16250290 ] 

Benjamin Mahler edited comment on MESOS-8129 at 11/13/17 10:54 PM:
-------------------------------------------------------------------

Thanks Bruce for the well written ticket. Determining the biggest safe scalar would take me some time given the fractional component complicates matters over integers in doubles (I.e. where we can just use Number.MAX_SAFE_INTEGER).

Also, I'm curious what your use case is, can you tell me?


was (Author: bmahler):
Thanks Bruce for the well written ticket. Determining it more formally would take me some time given the fractional component complicates matters over integers in doubles (I.e. just use Number.MAX_SAFE_INTEGER).

Also, I'm curious what your use case is, can you tell me?

> Very large resource value crashes master
> ----------------------------------------
>
>                 Key: MESOS-8129
>                 URL: https://issues.apache.org/jira/browse/MESOS-8129
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, master
>    Affects Versions: 1.4.0
>         Environment: Ubuntu 14.04
> Both apt packages from Mesosphere repo and Docker images
>            Reporter: Bruce Merry
>            Assignee: Benjamin Mahler
>            Priority: Minor
>
> I ran into a master that kept failing on this CHECK when destroying a task:
> https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367
> I found that a combination of a misconfiguration and a suboptimal choice of units had let to an agent with a custom scalar resource of capacity 4294967295000000. I believe what is happening is the pseudo-fixed-point arithmetic isn't able to cope with such large numbers, because rounding errors after arithmetic are bigger than 0.001. Examining the values in the debugger that the CHECK failed due to a rounding error on the order of 0.2.
> While this is probably a fundamental limitation of the fixed-point implementation and such large resource values are probably a bad idea, it would have helped if the agent had complained on startup, rather than having to debug an internal assertion failure. I'd suggest that values larger than, say, 10^12 should be rejected when the agent starts (which is why I've added the agent component), although someone familiar with the details of the fixed-point implementation should probably verify that number.
> I'm not sure where this needs to be fixed e.g. if it can just be validated on agent startup or if it should be baked into the Resource class to prevent accidents in requests from the user.
> To reproduce the issue, start a master and an agent with a custom scalar resource "thing:4294967295000000", then use mesos-execute to throw the following task at it (it'll probably also work with a smaller Docker image - that's just one I already had on the agent). When the sleep ends, the master crashes.
> {code:javascript}
> {
>   "container": {
>     "docker": {
>       "image": "ubuntu:xenial-20161010"
>     }, 
>     "type": "DOCKER"
>   }, 
>   "name": "test-task", 
>   "task_id": {
>     "value": "00000001"
>   }, 
>   "command": {
>     "shell": false, 
>     "value": "sleep", 
>     "arguments": [
>       "10"
>     ]
>   }, 
>   "agent_id": {
>     "value": ""
>   }, 
>   "resources": [
>     {
>       "scalar": {
>         "value": 1
>       }, 
>       "type": "SCALAR", 
>       "name": "cpus"
>     }, 
>     {
>       "scalar": {
>         "value": 4106.0
>       }, 
>       "type": "SCALAR", 
>       "name": "mem"
>     }, 
>     {
>       "scalar": {
>         "value": 12465430.06012024
>       }, 
>       "type": "SCALAR", 
>       "name": "thing"
>     }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)