You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@aurora.apache.org by "Jay Buffington (JIRA)" <ji...@apache.org> on 2014/07/18 04:28:05 UTC
[jira] [Updated] (AURORA-589) using {{thermos.ports[foo_{{mesos.instance}}]}} creates task that can never start

     [ https://issues.apache.org/jira/browse/AURORA-589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jay Buffington updated AURORA-589:
----------------------------------

    Description: 
Today a user wrote a .aurora file that had this command line:
  
{noformat}  
    'echo {{thermos.ports[foo_{{mesos.instance}}]}}; sleep 300',
{noformat}

It caused thermos_runner to fail to start.  __main__.log had this content:

{noformat}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0717 22:12:44.033905 25696 fetcher.cpp:76] Fetching URI '/usr/local/bin/thermos_executor'
I0717 22:12:44.034059 25696 fetcher.cpp:179] Copying resource from '/usr/local/bin/thermos_executor' to '/tmp/mesos/slaves/20140623-183547-1749004561-5050-30136-42/frameworks/20140522-213145-1749004561-5050-29512-0000/executors/thermos-1405635163935-jaybuff-test-thermos_bug-0-386f11a7-3998-460b-ade7-7293cc75a860/runs/f72554fd-e6a0-46c7-8d09-989fd0b4c399'
twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
Writing log files to disk in .
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0717 22:12:45.586942 25695 exec.cpp:131] Version: 0.18.0
I0717 22:12:45.590029 25716 exec.cpp:205] Executor registered on slave 20140623-183547-1749004561-5050-30136-42
Writing log files to disk in .
thermos_runner.pex: error: ERROR!  Unbound ports: foo_0
ERROR] Task did not start with in deadline, forcing loss.
FATAL] Task initialization failed: Task did not start within deadline.
twitter.common.app debug: Shutting application down.
twitter.common.app debug: Running exit function for twitter.common.log (Logging subsystem.)
twitter.common.app debug: Finishing up module teardown.
twitter.common.app debug:   Active thread: <_MainThread(MainThread, started 139823719085824)>
twitter.common.app debug:   Active thread (daemon): <_DummyThread(Dummy-2, started daemon 139823370553088)>
twitter.common.app debug: Exiting cleanly.
{noformat}

The aurora scheduler web ui shows the task failing over and over with each completed task showing this state transition:

{noformat}
07/17 5:29:45 local - PENDING - Rescheduled
07/17 5:29:45 local - ASSIGNED
07/17 5:29:46 local - STARTING - Initializing sandbox.
07/17 5:30:46 local - FAILED - Task initialization failed: Task did not start within deadline
{noformat}

I'm still discussing with my user if he has a valid use case for this or not (amazingly, I think he might).  Either way, the client knows that there is a port in the cmdline and it's not setting requestedPorts, so the client should fail before sending the request to the scheduler the runner will never be able to start.

I've told him to work around it by writing a python loop that iterates over all the instances and add {noformat}echo  {{thermos.ports[foo_%d]}}{noformat} to a cmdline

  was:
Today a user wrote a .aurora file that had this command line:
  
{noformat}  
    'echo {{thermos.ports[foo_{{mesos.instance}}]}}; sleep 300',
{noformat}

It caused thermos_runner to fail to start.  __main__.log had this content:

{noformat}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0717 22:12:44.033905 25696 fetcher.cpp:76] Fetching URI '/usr/local/bin/thermos_executor'
I0717 22:12:44.034059 25696 fetcher.cpp:179] Copying resource from '/usr/local/bin/thermos_executor' to '/tmp/mesos/slaves/20140623-183547-1749004561-5050-30136-42/frameworks/20140522-213145-1749004561-5050-29512-0000/executors/thermos-1405635163935-jaybuff-test-thermos_bug-0-386f11a7-3998-460b-ade7-7293cc75a860/runs/f72554fd-e6a0-46c7-8d09-989fd0b4c399'
twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
Writing log files to disk in .
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0717 22:12:45.586942 25695 exec.cpp:131] Version: 0.18.0
I0717 22:12:45.590029 25716 exec.cpp:205] Executor registered on slave 20140623-183547-1749004561-5050-30136-42
Writing log files to disk in .
thermos_runner.pex: error: ERROR!  Unbound ports: foo_0
ERROR] Task did not start with in deadline, forcing loss.
FATAL] Task initialization failed: Task did not start within deadline.
twitter.common.app debug: Shutting application down.
twitter.common.app debug: Running exit function for twitter.common.log (Logging subsystem.)
twitter.common.app debug: Finishing up module teardown.
twitter.common.app debug:   Active thread: <_MainThread(MainThread, started 139823719085824)>
twitter.common.app debug:   Active thread (daemon): <_DummyThread(Dummy-2, started daemon 139823370553088)>
twitter.common.app debug: Exiting cleanly.
{noformat}

The aurora scheduler web ui shows the task failing over and over with each completed task showing this state transition:

{noformat}
07/17 5:29:45 local - PENDING - Rescheduled
07/17 5:29:45 local - ASSIGNED
07/17 5:29:46 local - STARTING - Initializing sandbox.
07/17 5:30:46 local - FAILED - Task initialization failed: Task did not start within deadline
{noformat}

I'm still discussing with my user if he has a valid use case for this or not (amazingly, I think he might).  Either way, the client knows that there is a port in the cmdline and it's not setting requestedPorts, so the client should fail before sending the request to the scheduler the runner will never be able to start.

I've told him to work around it by writing a python loop that iterates over all the instances and prints out "echo  {{thermos.ports[foo_%d]}}"


> using {{thermos.ports[foo_{{mesos.instance}}]}} creates task that can never start
> ---------------------------------------------------------------------------------
>
>                 Key: AURORA-589
>                 URL: https://issues.apache.org/jira/browse/AURORA-589
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Jay Buffington
>            Priority: Critical
>
> Today a user wrote a .aurora file that had this command line:
>   
> {noformat}  
>     'echo {{thermos.ports[foo_{{mesos.instance}}]}}; sleep 300',
> {noformat}
> It caused thermos_runner to fail to start.  __main__.log had this content:
> {noformat}
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0717 22:12:44.033905 25696 fetcher.cpp:76] Fetching URI '/usr/local/bin/thermos_executor'
> I0717 22:12:44.034059 25696 fetcher.cpp:179] Copying resource from '/usr/local/bin/thermos_executor' to '/tmp/mesos/slaves/20140623-183547-1749004561-5050-30136-42/frameworks/20140522-213145-1749004561-5050-29512-0000/executors/thermos-1405635163935-jaybuff-test-thermos_bug-0-386f11a7-3998-460b-ade7-7293cc75a860/runs/f72554fd-e6a0-46c7-8d09-989fd0b4c399'
> twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
> Writing log files to disk in .
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0717 22:12:45.586942 25695 exec.cpp:131] Version: 0.18.0
> I0717 22:12:45.590029 25716 exec.cpp:205] Executor registered on slave 20140623-183547-1749004561-5050-30136-42
> Writing log files to disk in .
> thermos_runner.pex: error: ERROR!  Unbound ports: foo_0
> ERROR] Task did not start with in deadline, forcing loss.
> FATAL] Task initialization failed: Task did not start within deadline.
> twitter.common.app debug: Shutting application down.
> twitter.common.app debug: Running exit function for twitter.common.log (Logging subsystem.)
> twitter.common.app debug: Finishing up module teardown.
> twitter.common.app debug:   Active thread: <_MainThread(MainThread, started 139823719085824)>
> twitter.common.app debug:   Active thread (daemon): <_DummyThread(Dummy-2, started daemon 139823370553088)>
> twitter.common.app debug: Exiting cleanly.
> {noformat}
> The aurora scheduler web ui shows the task failing over and over with each completed task showing this state transition:
> {noformat}
> 07/17 5:29:45 local - PENDING - Rescheduled
> 07/17 5:29:45 local - ASSIGNED
> 07/17 5:29:46 local - STARTING - Initializing sandbox.
> 07/17 5:30:46 local - FAILED - Task initialization failed: Task did not start within deadline
> {noformat}
> I'm still discussing with my user if he has a valid use case for this or not (amazingly, I think he might).  Either way, the client knows that there is a port in the cmdline and it's not setting requestedPorts, so the client should fail before sending the request to the scheduler the runner will never be able to start.
> I've told him to work around it by writing a python loop that iterates over all the instances and add {noformat}echo  {{thermos.ports[foo_%d]}}{noformat} to a cmdline



--
This message was sent by Atlassian JIRA
(v6.2#6252)