You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@slider.apache.org by Sarthak Kukreti <sk...@ncsu.edu> on 2016/07/07 18:50:19 UTC

[slider-agent] Different "base-dir" for STOP and STATUS commands

Hello!

I am trying to use Slider to distribute an application over a YARN
cluster. While attempting to use "slider flex" to decrease the number
of containers allocated for the application (using the kafka
app-package as reference), I came across the following error:

ERROR 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:169 -
Caught an exception while executing command: <class
'AgentException.AgentException'>: 'Script
/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py
does not exist'
Traceback (most recent call last):
  File "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
line 115, in runCommand
    script_path = self.resolve_script_path(self.base_dir, script, script_type)
  File "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
line 199, in resolve_script_path
    raise AgentException(message)
AgentException: 'Script
/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py
does not exist'


(Seems like the directory is cleared up before the command)

Additionally, I tried adding debug prints in the
CustomServiceOrchestrator to see what base directory is used for
invoking the script and found that the base directories for STATUS and
STOP command differ:

STATUS command:

INFO 2016-07-07 10:56:31,323 AgentToggleLogger.py:40 - Adding
STATUS_COMMAND for service kc of cluster kc to the queue.
INFO 2016-07-07 10:56:31,327 CustomServiceOrchestrator.py:114 - Base
dir: /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip/package


STOP command:

INFO 2016-07-07 10:57:36,455 AgentToggleLogger.py:40 - Adding
EXECUTION_COMMAND for service kc of cluster kc to the queue.
INFO 2016-07-07 10:57:36,456 Controller.py:251 - Attempting to
gracefully stop the application ...
INFO 2016-07-07 10:57:36,458 ActionQueue.py:134 - Package received:
INFO 2016-07-07 10:57:36,458 ActionQueue.py:140 - Executing command
with id = 4-1 for role = Hello of cluster kc
INFO 2016-07-07 10:57:36,460 ActionQueue.py:170 - Running command:
{u'roleCommand': u'STOP', u'clusterName': u'kc', u'componentName':
u'Hello', u'hostname': u'192.168.1.195', u'hostLevelParams':
{u'java_home': u'/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home/',
u'container_id': u'container_1467829690678_0022_01_000003'},
u'commandType': u'EXECUTION_COMMAND', u'roleParams': {u'auto_restart':
u'false'}, u'serviceName': u'kc', u'role': u'Hello', u'commandParams':
{u'record_config': u'true', u'service_package_folder':
u'${AGENT_WORK_ROOT}/work/app/definition/package', u'script':
u'scripts/kafka.py', u'schema_version': u'2.0', u'command_timeout':
u'600', u'script_type': u'PYTHON'}, u'taskId': 4, u'commandId':
u'4-1', u'containers': [], u'configurations': {u'global':
{u'security_enabled': u'false', u'app_container_id':
u'container_1467829690678_0022_01_000003', u'listen_port': u'52508',
u'app_root': u'${AGENT_WORK_ROOT}/app/install', u'app_log_dir':
u'${AGENT_LOG_ROOT}', u'kc_version': u'1.0.0', u'app_pid_dir':
u'${AGENT_WORK_ROOT}/app/run', u'app_container_tag': u'2',
u'pid_file': u'${AGENT_WORK_ROOT}/app/run/kc.pid', u'app_install_dir':
u'${AGENT_WORK_ROOT}/app/install', u'app_user': u'sarthakk',
u'app_input_conf_dir': u'${AGENT_WORK_ROOT}/propagatedconf'},
u'server': {}}}
INFO 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:114 - Base
dir: /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package

For some reason, the STOP command attempts to pick up the script from
the container specific location, where the STATUS command goes through
an entirely different path (I am not sure though if this is the cause
of the issue). Any more pointers to debug this would be really
helpful.

(For reference, platform: OS X, Python 2.7.11)

Thank you
Sarthak

Re: [slider-agent] Different "base-dir" for STOP and STATUS commands

Posted by Gour Saha <gs...@hortonworks.com>.
Is this a multi node cluster?

If yes, can you check the time across all the nodes and make sure they are
in sync.

If not, did you override any timeout properties via the app config or
resources? If you could share these json files which you used to start the
app, it will help to debug further.

-Gour

On 7/7/16, 4:28 PM, "Sarthak Kukreti" <sk...@ncsu.edu> wrote:

>Thanks! That was helpful. (Strangely) As it turns out, the container
>is released (and cleaned up) even before the STOP command is queued.
>Some more logs:
>
>Node Manager:
>---------------------
>2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState - Role
>ConnectD flexed from 2 to 1
>2016-07-07 15:50:14,148 [AmExecutor-006] WARN  state.AppState -
>Resource requirements of ConnectD normalized from <memory:2000,
>vCores:1> to <memory:2048, vCores:1>
>2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
>Resetting failure count of slider-appmaster; was 0
>2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
>Resetting failure count of ConnectD; was 0
>2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.RoleHistory -
>Resetting failure history
>2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
>Reviewing RoleStatus{name='ConnectD', key=1, desired=1, actual=2,
>requested=0, releasing=0, failed=0, startFailed=0, started=2,
>completed=0, totalRequested=2, preempted=0, nodeFailed=0,
>failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:2048,
>vCores:1>, isAntiAffinePlacement=false, failureMessage='',
>providerRole=ProviderRole{name='ConnectD', id=1, placementPolicy=1,
>nodeFailureThreshold=3, placementTimeoutSeconds=30,
>labelExpression='null'}} :
>2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
>ConnectD: Asking for 1 fewer node(s) for a total of 1
>2016-07-07 15:50:14,150 [AmExecutor-006] INFO  state.AppState -
>Targeting for release: RoleInstance{role='ConnectD',
>id='container_1467829690678_0031_01_000003',
>container=ContainerID=container_1467829690678_0031_01_000003
>nodeID=192.168.1.195:58591 http=192.168.1.195:8042 priority=1073741825
>resource=<memory:2048, vCores:1>, createTime=1467931727650,
>startTime=1467931727656, released=false, roleId=1, host=192.168.1.195,
>hostURL=http://192.168.1.195:8042, state=3, placement=null,
>exitCode=0, command='python ./infra/agent/slider-agent/agent/main.py
>--label container_1467829690678_0031_01_000003___ConnectD --zk-quorum
>localhost:2181 --zk-reg-path
>/registry/users/sarthakk/services/org-apache-slider/kc >
><LOG_DIR>/slider-agent.out 2>&1 ; ', diagnostics='null', output=null,
>environment=[LANGUAGE="en_US.UTF-8", HADOOP_USER_NAME="sarthakk",
>PYTHONPATH="./infra/agent/slider-agent/", AGENT_LOG_ROOT="<LOG_DIR>",
>SLIDER_PASSPHRASE="QhURcbCISl0zGwdgkbpbIBKAlIJOmsgKkRgmvuq5ytVDlguFa0",
>LC_ALL="en_US.UTF-8", AGENT_WORK_ROOT="$PWD", LANG="en_US.UTF-8"]}
>
>2016-07-07 15:50:14,975 [AMRM Callback Handler Thread] INFO
>appmaster.SliderAppMaster - onContainersCompleted([1]
>2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
>appmaster.SliderAppMaster - Container Completion for
>containerID=container_1467829690678_0031_01_000003, state=COMPLETE,
>exitStatus=-100, diagnostics=Container released by application
>2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
>state.AppState - Container was queued for release :
>container_1467829690678_0031_01_000003
>2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
>state.AppState - decrementing role count for role ConnectD to 1;
>releasing=0, completed=1
>2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
>state.RoleHistory - Finished container for node 1, released=true,
>shortlived=false
>2016-07-07 15:50:14,987 [AMRM Callback Handler Thread] INFO
>state.AppState - Removing node ID
>container_1467829690678_0031_01_000003
>2016-07-07 15:50:14,988 [AMRM Callback Handler Thread] INFO
>agent.AgentProviderService - Removing component status for label
>container_1467829690678_0031_01_000003___ConnectD
>
>2016-07-07 15:50:14,988 [AmExecutor-006] INFO
>appmaster.SliderAppMaster - Unregistering component
>container_1467829690678_0031_01_000003
>2016-07-07 15:50:14,989 [AmExecutor-006] INFO  state.AppState -
>Reviewing RoleStatus{name='ConnectD', key=1, desired=1, actual=1,
>requested=0, releasing=0, failed=0, startFailed=0, started=2,
>completed=1, totalRequested=2, preempted=0, nodeFailed=0,
>failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:2048,
>vCores:1>, isAntiAffinePlacement=false, failureMessage='',
>providerRole=ProviderRole{name='ConnectD', id=1, placementPolicy=1,
>nodeFailureThreshold=3, placementTimeoutSeconds=30,
>labelExpression='null'}} :
>2016-07-07 15:50:23,715 [430377746@qtp-1726515679-4] INFO
>agent.AgentProviderService - Sending terminate signal to completed
>container (still heartbeating):
>container_1467829690678_0031_01_000003___ConnectD
>2016-07-07 16:12:43,891 [AmExecutor-006] INFO  state.AppState -
>Resetting failure count of slider-appmaster; was 0
>2016-07-07 16:12:43,892 [AmExecutor-006] INFO  state.AppState -
>Resetting failure count of ConnectD; was 0
>2016-07-07 16:12:43,892 [AmExecutor-006] INFO  state.RoleHistory -
>Resetting failure history
>
>
>slider-agent Logs:
>---------------------
>INFO 2016-07-07 15:50:23,720 ActionQueue.py:170 - Running command:
>{u'roleCommand': u'STOP', u'clusterName': u'kc', u'componentName':
>u'ConnectD', u'hostname': u'192.168.1.195', u'hostLevelParams':
>{u'java_home': u'${JAVA_HOME}', u'container_id':
>u'container_1467829690678_0031_01_000003'}, u'commandType':
>u'EXECUTION_COMMAND', u'roleParams': {u'auto_restart': u'false'},
>u'serviceName': u'kc', u'role': u'ConnectD', u'commandParams':
>{u'record_config': u'true', u'service_package_folder':
>u'${AGENT_WORK_ROOT}/work/app/definition/package', u'script':
>u'scripts/kafka.py', u'schema_version': u'2.0', u'command_timeout':
>u'600', u'script_type': u'PYTHON'}, u'taskId': 4, u'commandId':
>u'4-1', u'containers': [], u'configurations': {u'global':
>{u'security_enabled': u'false', u'app_container_id':
>u'container_1467829690678_0031_01_000003', u'listen_port': u'51670',
>u'app_root': u'${AGENT_WORK_ROOT}/app/install', u'app_log_dir':
>u'${AGENT_LOG_ROOT}', u'kc_version': u'1.0.0', u'app_pid_dir':
>u'${AGENT_WORK_ROOT}/app/run', u'app_container_tag': u'2',
>u'pid_file': u'${AGENT_WORK_ROOT}/app/run/kc.pid', u'app_install_dir':
>u'${AGENT_WORK_ROOT}/app/install', u'app_user': u'sarthakk',
>u'app_input_conf_dir': u'${AGENT_WORK_ROOT}/propagatedconf'},
>u'server': {}}}
>INFO 2016-07-07 15:50:23,720 CustomServiceOrchestrator.py:114 - Base
>dir: 
>/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_146
>7829690678_0031/container_1467829690678_0031_01_000003/app/definition/pack
>age
>
>ERROR 2016-07-07 15:50:23,720 CustomServiceOrchestrator.py:169 -
>Caught an exception while executing command: <class
>'AgentException.AgentException'>: 'Script
>/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_146
>7829690678_0031/container_1467829690678_0031_01_000003/app/definition/pack
>age/scripts/kafka.py
>does not exist'
>Traceback (most recent call last):
>  File 
>"/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_14
>67829690678_0031/filecache/71/slider-agent.tar.gz/slider-agent/agent/Custo
>mServiceOrchestrator.py",
>line 115, in runCommand
>    script_path = self.resolve_script_path(self.base_dir, script,
>script_type)
>  File 
>"/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_14
>67829690678_0031/filecache/71/slider-agent.tar.gz/slider-agent/agent/Custo
>mServiceOrchestrator.py",
>line 199, in resolve_script_path
>    raise AgentException(message)
>AgentException: 'Script
>/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_146
>7829690678_0031/container_1467829690678_0031_01_000003/app/definition/pack
>age/scripts/kafka.py
>does not exist'
>INFO 2016-07-07 15:50:23,721 ActionQueue.py:188 - Stop command received
>INFO 2016-07-07 15:50:23,824 Controller.py:223 - Terminate agent
>command received from AM, stopping the agent ...
>INFO 2016-07-07 15:50:23,824 ProcessHelper.py:39 - Removing pid file
>WARNING 2016-07-07 15:50:23,825 ProcessHelper.py:44 - Unable to remove
>pid file: [Errno 2] No such file or directory:
>'/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_14
>67829690678_0031/container_1467829690678_0031_01_000003/infra/run/agent.pi
>d'
>INFO 2016-07-07 15:50:23,825 ProcessHelper.py:46 - Removing temp files
>
>TL; DR: The callback to signal completion (and un-registration) of the
>container comes up around 15:50:14 whereas the call to
>CustomServiceOrchestrator (triggered by the terminate signal from the
>slider AM) is received around 15:50:23. I guess I am missing how the
>synchronization between termination of a container and cleanup of all
>its resources is handled by Slider (or whether this is handled by YARN
>itself).
>
>- Sarthak
>
>On Thu, Jul 7, 2016 at 2:48 PM, Billie Rinaldi <bi...@gmail.com>
>wrote:
>> If you look for the container ID in the nodemanager log on the host
>>where
>> the container was running, you should be able to see when the container
>> stopped and was cleaned up. Looks like it even logs when it deletes the
>> container directories.
>>
>> On Thu, Jul 7, 2016 at 2:04 PM, Sarthak Kukreti <sk...@ncsu.edu>
>>wrote:
>>
>>> kafka,py is still present in the filecache directory: its just the
>>> "container_1467829690678_0022_01_000003" directory that seems to be
>>> deleted before the runCommand() call
>>>
>>> - Sarthak
>>>
>>> On Thu, Jul 7, 2016 at 12:35 PM, Billie Rinaldi
>>> <bi...@gmail.com> wrote:
>>> > I think that
>>> >
>>> 
>>>/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1
>>>467829690678_0022/container_1467829690678_0022_01_000003/app/definition
>>> > is linked to
>>> >
>>> 
>>>/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1
>>>467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip,
>>> > so those are actually the same directory. I am not sure why it is
>>>saying
>>> > kafka.py does not exist when stopping the container; it definitely
>>>should
>>> > not clean up that directory while a container is still running. Can
>>>you
>>> > verify that app/definition/package/scripts/kafka.py exists for one
>>>of the
>>> > containers that is running?
>>> >
>>> > On Thu, Jul 7, 2016 at 11:50 AM, Sarthak Kukreti <sk...@ncsu.edu>
>>> wrote:
>>> >
>>> >> Hello!
>>> >>
>>> >> I am trying to use Slider to distribute an application over a YARN
>>> >> cluster. While attempting to use "slider flex" to decrease the
>>>number
>>> >> of containers allocated for the application (using the kafka
>>> >> app-package as reference), I came across the following error:
>>> >>
>>> >> ERROR 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:169 -
>>> >> Caught an exception while executing command: <class
>>> >> 'AgentException.AgentException'>: 'Script
>>> >>
>>> >>
>>> 
>>>/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1
>>>467829690678_0022/container_1467829690678_0022_01_000003/app/definition/
>>>package/scripts/kafka.py
>>> >> does not exist'
>>> >> Traceback (most recent call last):
>>> >>   File
>>> >>
>>> 
>>>"/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_
>>>1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/C
>>>ustomServiceOrchestrator.py",
>>> >> line 115, in runCommand
>>> >>     script_path = self.resolve_script_path(self.base_dir, script,
>>> >> script_type)
>>> >>   File
>>> >>
>>> 
>>>"/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_
>>>1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/C
>>>ustomServiceOrchestrator.py",
>>> >> line 199, in resolve_script_path
>>> >>     raise AgentException(message)
>>> >> AgentException: 'Script
>>> >>
>>> >>
>>> 
>>>/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1
>>>467829690678_0022/container_1467829690678_0022_01_000003/app/definition/
>>>package/scripts/kafka.py
>>> >> does not exist'
>>> >>
>>> >>
>>> >> (Seems like the directory is cleared up before the command)
>>> >>
>>> >> Additionally, I tried adding debug prints in the
>>> >> CustomServiceOrchestrator to see what base directory is used for
>>> >> invoking the script and found that the base directories for STATUS
>>>and
>>> >> STOP command differ:
>>> >>
>>> >> STATUS command:
>>> >>
>>> >> INFO 2016-07-07 10:56:31,323 AgentToggleLogger.py:40 - Adding
>>> >> STATUS_COMMAND for service kc of cluster kc to the queue.
>>> >> INFO 2016-07-07 10:56:31,327 CustomServiceOrchestrator.py:114 - Base
>>> >> dir:
>>> >>
>>> 
>>>/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1
>>>467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip/package
>>> >>
>>> >>
>>> >> STOP command:
>>> >>
>>> >> INFO 2016-07-07 10:57:36,455 AgentToggleLogger.py:40 - Adding
>>> >> EXECUTION_COMMAND for service kc of cluster kc to the queue.
>>> >> INFO 2016-07-07 10:57:36,456 Controller.py:251 - Attempting to
>>> >> gracefully stop the application ...
>>> >> INFO 2016-07-07 10:57:36,458 ActionQueue.py:134 - Package received:
>>> >> INFO 2016-07-07 10:57:36,458 ActionQueue.py:140 - Executing command
>>> >> with id = 4-1 for role = Hello of cluster kc
>>> >> INFO 2016-07-07 10:57:36,460 ActionQueue.py:170 - Running command:
>>> >> {u'roleCommand': u'STOP', u'clusterName': u'kc', u'componentName':
>>> >> u'Hello', u'hostname': u'192.168.1.195', u'hostLevelParams':
>>> >> {u'java_home':
>>> >> u'/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home/',
>>> >> u'container_id': u'container_1467829690678_0022_01_000003'},
>>> >> u'commandType': u'EXECUTION_COMMAND', u'roleParams':
>>>{u'auto_restart':
>>> >> u'false'}, u'serviceName': u'kc', u'role': u'Hello',
>>>u'commandParams':
>>> >> {u'record_config': u'true', u'service_package_folder':
>>> >> u'${AGENT_WORK_ROOT}/work/app/definition/package', u'script':
>>> >> u'scripts/kafka.py', u'schema_version': u'2.0', u'command_timeout':
>>> >> u'600', u'script_type': u'PYTHON'}, u'taskId': 4, u'commandId':
>>> >> u'4-1', u'containers': [], u'configurations': {u'global':
>>> >> {u'security_enabled': u'false', u'app_container_id':
>>> >> u'container_1467829690678_0022_01_000003', u'listen_port': u'52508',
>>> >> u'app_root': u'${AGENT_WORK_ROOT}/app/install', u'app_log_dir':
>>> >> u'${AGENT_LOG_ROOT}', u'kc_version': u'1.0.0', u'app_pid_dir':
>>> >> u'${AGENT_WORK_ROOT}/app/run', u'app_container_tag': u'2',
>>> >> u'pid_file': u'${AGENT_WORK_ROOT}/app/run/kc.pid',
>>>u'app_install_dir':
>>> >> u'${AGENT_WORK_ROOT}/app/install', u'app_user': u'sarthakk',
>>> >> u'app_input_conf_dir': u'${AGENT_WORK_ROOT}/propagatedconf'},
>>> >> u'server': {}}}
>>> >> INFO 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:114 - Base
>>> >> dir:
>>> >>
>>> 
>>>/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1
>>>467829690678_0022/container_1467829690678_0022_01_000003/app/definition/
>>>package
>>> >>
>>> >> For some reason, the STOP command attempts to pick up the script
>>>from
>>> >> the container specific location, where the STATUS command goes
>>>through
>>> >> an entirely different path (I am not sure though if this is the
>>>cause
>>> >> of the issue). Any more pointers to debug this would be really
>>> >> helpful.
>>> >>
>>> >> (For reference, platform: OS X, Python 2.7.11)
>>> >>
>>> >> Thank you
>>> >> Sarthak
>>> >>
>>>
>


Re: [slider-agent] Different "base-dir" for STOP and STATUS commands

Posted by Billie Rinaldi <bi...@gmail.com>.
Ah, I think this is happening because you're using OS X. Normally YARN
kills the entire process group (including child processes) when the
container is killed, but it cannot do this on OS X. So the agent process is
living longer than it should.
A question we may want to look into would be why the AM is not waiting for
the stop command to be executed before releasing the container.

Billie

On Thu, Jul 7, 2016 at 4:28 PM, Sarthak Kukreti <sk...@ncsu.edu> wrote:

> Thanks! That was helpful. (Strangely) As it turns out, the container
> is released (and cleaned up) even before the STOP command is queued.
> Some more logs:
>
> Node Manager:
> ---------------------
> 2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState - Role
> ConnectD flexed from 2 to 1
> 2016-07-07 15:50:14,148 [AmExecutor-006] WARN  state.AppState -
> Resource requirements of ConnectD normalized from <memory:2000,
> vCores:1> to <memory:2048, vCores:1>
> 2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
> Resetting failure count of slider-appmaster; was 0
> 2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
> Resetting failure count of ConnectD; was 0
> 2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.RoleHistory -
> Resetting failure history
> 2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
> Reviewing RoleStatus{name='ConnectD', key=1, desired=1, actual=2,
> requested=0, releasing=0, failed=0, startFailed=0, started=2,
> completed=0, totalRequested=2, preempted=0, nodeFailed=0,
> failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:2048,
> vCores:1>, isAntiAffinePlacement=false, failureMessage='',
> providerRole=ProviderRole{name='ConnectD', id=1, placementPolicy=1,
> nodeFailureThreshold=3, placementTimeoutSeconds=30,
> labelExpression='null'}} :
> 2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
> ConnectD: Asking for 1 fewer node(s) for a total of 1
> 2016-07-07 15:50:14,150 [AmExecutor-006] INFO  state.AppState -
> Targeting for release: RoleInstance{role='ConnectD',
> id='container_1467829690678_0031_01_000003',
> container=ContainerID=container_1467829690678_0031_01_000003
> nodeID=192.168.1.195:58591 http=192.168.1.195:8042 priority=1073741825
> resource=<memory:2048, vCores:1>, createTime=1467931727650,
> startTime=1467931727656, released=false, roleId=1, host=192.168.1.195,
> hostURL=http://192.168.1.195:8042, state=3, placement=null,
> exitCode=0, command='python ./infra/agent/slider-agent/agent/main.py
> --label container_1467829690678_0031_01_000003___ConnectD --zk-quorum
> localhost:2181 --zk-reg-path
> /registry/users/sarthakk/services/org-apache-slider/kc >
> <LOG_DIR>/slider-agent.out 2>&1 ; ', diagnostics='null', output=null,
> environment=[LANGUAGE="en_US.UTF-8", HADOOP_USER_NAME="sarthakk",
> PYTHONPATH="./infra/agent/slider-agent/", AGENT_LOG_ROOT="<LOG_DIR>",
> SLIDER_PASSPHRASE="QhURcbCISl0zGwdgkbpbIBKAlIJOmsgKkRgmvuq5ytVDlguFa0",
> LC_ALL="en_US.UTF-8", AGENT_WORK_ROOT="$PWD", LANG="en_US.UTF-8"]}
>
> 2016-07-07 15:50:14,975 [AMRM Callback Handler Thread] INFO
> appmaster.SliderAppMaster - onContainersCompleted([1]
> 2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
> appmaster.SliderAppMaster - Container Completion for
> containerID=container_1467829690678_0031_01_000003, state=COMPLETE,
> exitStatus=-100, diagnostics=Container released by application
> 2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
> state.AppState - Container was queued for release :
> container_1467829690678_0031_01_000003
> 2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
> state.AppState - decrementing role count for role ConnectD to 1;
> releasing=0, completed=1
> 2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
> state.RoleHistory - Finished container for node 1, released=true,
> shortlived=false
> 2016-07-07 15:50:14,987 [AMRM Callback Handler Thread] INFO
> state.AppState - Removing node ID
> container_1467829690678_0031_01_000003
> 2016-07-07 15:50:14,988 [AMRM Callback Handler Thread] INFO
> agent.AgentProviderService - Removing component status for label
> container_1467829690678_0031_01_000003___ConnectD
>
> 2016-07-07 15:50:14,988 [AmExecutor-006] INFO
> appmaster.SliderAppMaster - Unregistering component
> container_1467829690678_0031_01_000003
> 2016-07-07 15:50:14,989 [AmExecutor-006] INFO  state.AppState -
> Reviewing RoleStatus{name='ConnectD', key=1, desired=1, actual=1,
> requested=0, releasing=0, failed=0, startFailed=0, started=2,
> completed=1, totalRequested=2, preempted=0, nodeFailed=0,
> failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:2048,
> vCores:1>, isAntiAffinePlacement=false, failureMessage='',
> providerRole=ProviderRole{name='ConnectD', id=1, placementPolicy=1,
> nodeFailureThreshold=3, placementTimeoutSeconds=30,
> labelExpression='null'}} :
> 2016-07-07 15:50:23,715 [430377746@qtp-1726515679-4] INFO
> agent.AgentProviderService - Sending terminate signal to completed
> container (still heartbeating):
> container_1467829690678_0031_01_000003___ConnectD
> 2016-07-07 16:12:43,891 [AmExecutor-006] INFO  state.AppState -
> Resetting failure count of slider-appmaster; was 0
> 2016-07-07 16:12:43,892 [AmExecutor-006] INFO  state.AppState -
> Resetting failure count of ConnectD; was 0
> 2016-07-07 16:12:43,892 [AmExecutor-006] INFO  state.RoleHistory -
> Resetting failure history
>
>
> slider-agent Logs:
> ---------------------
> INFO 2016-07-07 15:50:23,720 ActionQueue.py:170 - Running command:
> {u'roleCommand': u'STOP', u'clusterName': u'kc', u'componentName':
> u'ConnectD', u'hostname': u'192.168.1.195', u'hostLevelParams':
> {u'java_home': u'${JAVA_HOME}', u'container_id':
> u'container_1467829690678_0031_01_000003'}, u'commandType':
> u'EXECUTION_COMMAND', u'roleParams': {u'auto_restart': u'false'},
> u'serviceName': u'kc', u'role': u'ConnectD', u'commandParams':
> {u'record_config': u'true', u'service_package_folder':
> u'${AGENT_WORK_ROOT}/work/app/definition/package', u'script':
> u'scripts/kafka.py', u'schema_version': u'2.0', u'command_timeout':
> u'600', u'script_type': u'PYTHON'}, u'taskId': 4, u'commandId':
> u'4-1', u'containers': [], u'configurations': {u'global':
> {u'security_enabled': u'false', u'app_container_id':
> u'container_1467829690678_0031_01_000003', u'listen_port': u'51670',
> u'app_root': u'${AGENT_WORK_ROOT}/app/install', u'app_log_dir':
> u'${AGENT_LOG_ROOT}', u'kc_version': u'1.0.0', u'app_pid_dir':
> u'${AGENT_WORK_ROOT}/app/run', u'app_container_tag': u'2',
> u'pid_file': u'${AGENT_WORK_ROOT}/app/run/kc.pid', u'app_install_dir':
> u'${AGENT_WORK_ROOT}/app/install', u'app_user': u'sarthakk',
> u'app_input_conf_dir': u'${AGENT_WORK_ROOT}/propagatedconf'},
> u'server': {}}}
> INFO 2016-07-07 15:50:23,720 CustomServiceOrchestrator.py:114 - Base
> dir:
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/app/definition/package
>
> ERROR 2016-07-07 15:50:23,720 CustomServiceOrchestrator.py:169 -
> Caught an exception while executing command: <class
> 'AgentException.AgentException'>: 'Script
>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/app/definition/package/scripts/kafka.py
> does not exist'
> Traceback (most recent call last):
>   File
> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
> line 115, in runCommand
>     script_path = self.resolve_script_path(self.base_dir, script,
> script_type)
>   File
> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
> line 199, in resolve_script_path
>     raise AgentException(message)
> AgentException: 'Script
>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/app/definition/package/scripts/kafka.py
> does not exist'
> INFO 2016-07-07 15:50:23,721 ActionQueue.py:188 - Stop command received
> INFO 2016-07-07 15:50:23,824 Controller.py:223 - Terminate agent
> command received from AM, stopping the agent ...
> INFO 2016-07-07 15:50:23,824 ProcessHelper.py:39 - Removing pid file
> WARNING 2016-07-07 15:50:23,825 ProcessHelper.py:44 - Unable to remove
> pid file: [Errno 2] No such file or directory:
>
> '/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/infra/run/agent.pid'
> INFO 2016-07-07 15:50:23,825 ProcessHelper.py:46 - Removing temp files
>
> TL; DR: The callback to signal completion (and un-registration) of the
> container comes up around 15:50:14 whereas the call to
> CustomServiceOrchestrator (triggered by the terminate signal from the
> slider AM) is received around 15:50:23. I guess I am missing how the
> synchronization between termination of a container and cleanup of all
> its resources is handled by Slider (or whether this is handled by YARN
> itself).
>
> - Sarthak
>
> On Thu, Jul 7, 2016 at 2:48 PM, Billie Rinaldi <bi...@gmail.com>
> wrote:
> > If you look for the container ID in the nodemanager log on the host where
> > the container was running, you should be able to see when the container
> > stopped and was cleaned up. Looks like it even logs when it deletes the
> > container directories.
> >
> > On Thu, Jul 7, 2016 at 2:04 PM, Sarthak Kukreti <sk...@ncsu.edu>
> wrote:
> >
> >> kafka,py is still present in the filecache directory: its just the
> >> "container_1467829690678_0022_01_000003" directory that seems to be
> >> deleted before the runCommand() call
> >>
> >> - Sarthak
> >>
> >> On Thu, Jul 7, 2016 at 12:35 PM, Billie Rinaldi
> >> <bi...@gmail.com> wrote:
> >> > I think that
> >> >
> >>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition
> >> > is linked to
> >> >
> >>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip,
> >> > so those are actually the same directory. I am not sure why it is
> saying
> >> > kafka.py does not exist when stopping the container; it definitely
> should
> >> > not clean up that directory while a container is still running. Can
> you
> >> > verify that app/definition/package/scripts/kafka.py exists for one of
> the
> >> > containers that is running?
> >> >
> >> > On Thu, Jul 7, 2016 at 11:50 AM, Sarthak Kukreti <sk...@ncsu.edu>
> >> wrote:
> >> >
> >> >> Hello!
> >> >>
> >> >> I am trying to use Slider to distribute an application over a YARN
> >> >> cluster. While attempting to use "slider flex" to decrease the number
> >> >> of containers allocated for the application (using the kafka
> >> >> app-package as reference), I came across the following error:
> >> >>
> >> >> ERROR 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:169 -
> >> >> Caught an exception while executing command: <class
> >> >> 'AgentException.AgentException'>: 'Script
> >> >>
> >> >>
> >>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py
> >> >> does not exist'
> >> >> Traceback (most recent call last):
> >> >>   File
> >> >>
> >>
> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
> >> >> line 115, in runCommand
> >> >>     script_path = self.resolve_script_path(self.base_dir, script,
> >> >> script_type)
> >> >>   File
> >> >>
> >>
> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
> >> >> line 199, in resolve_script_path
> >> >>     raise AgentException(message)
> >> >> AgentException: 'Script
> >> >>
> >> >>
> >>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py
> >> >> does not exist'
> >> >>
> >> >>
> >> >> (Seems like the directory is cleared up before the command)
> >> >>
> >> >> Additionally, I tried adding debug prints in the
> >> >> CustomServiceOrchestrator to see what base directory is used for
> >> >> invoking the script and found that the base directories for STATUS
> and
> >> >> STOP command differ:
> >> >>
> >> >> STATUS command:
> >> >>
> >> >> INFO 2016-07-07 10:56:31,323 AgentToggleLogger.py:40 - Adding
> >> >> STATUS_COMMAND for service kc of cluster kc to the queue.
> >> >> INFO 2016-07-07 10:56:31,327 CustomServiceOrchestrator.py:114 - Base
> >> >> dir:
> >> >>
> >>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip/package
> >> >>
> >> >>
> >> >> STOP command:
> >> >>
> >> >> INFO 2016-07-07 10:57:36,455 AgentToggleLogger.py:40 - Adding
> >> >> EXECUTION_COMMAND for service kc of cluster kc to the queue.
> >> >> INFO 2016-07-07 10:57:36,456 Controller.py:251 - Attempting to
> >> >> gracefully stop the application ...
> >> >> INFO 2016-07-07 10:57:36,458 ActionQueue.py:134 - Package received:
> >> >> INFO 2016-07-07 10:57:36,458 ActionQueue.py:140 - Executing command
> >> >> with id = 4-1 for role = Hello of cluster kc
> >> >> INFO 2016-07-07 10:57:36,460 ActionQueue.py:170 - Running command:
> >> >> {u'roleCommand': u'STOP', u'clusterName': u'kc', u'componentName':
> >> >> u'Hello', u'hostname': u'192.168.1.195', u'hostLevelParams':
> >> >> {u'java_home':
> >> >> u'/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home/',
> >> >> u'container_id': u'container_1467829690678_0022_01_000003'},
> >> >> u'commandType': u'EXECUTION_COMMAND', u'roleParams':
> {u'auto_restart':
> >> >> u'false'}, u'serviceName': u'kc', u'role': u'Hello',
> u'commandParams':
> >> >> {u'record_config': u'true', u'service_package_folder':
> >> >> u'${AGENT_WORK_ROOT}/work/app/definition/package', u'script':
> >> >> u'scripts/kafka.py', u'schema_version': u'2.0', u'command_timeout':
> >> >> u'600', u'script_type': u'PYTHON'}, u'taskId': 4, u'commandId':
> >> >> u'4-1', u'containers': [], u'configurations': {u'global':
> >> >> {u'security_enabled': u'false', u'app_container_id':
> >> >> u'container_1467829690678_0022_01_000003', u'listen_port': u'52508',
> >> >> u'app_root': u'${AGENT_WORK_ROOT}/app/install', u'app_log_dir':
> >> >> u'${AGENT_LOG_ROOT}', u'kc_version': u'1.0.0', u'app_pid_dir':
> >> >> u'${AGENT_WORK_ROOT}/app/run', u'app_container_tag': u'2',
> >> >> u'pid_file': u'${AGENT_WORK_ROOT}/app/run/kc.pid',
> u'app_install_dir':
> >> >> u'${AGENT_WORK_ROOT}/app/install', u'app_user': u'sarthakk',
> >> >> u'app_input_conf_dir': u'${AGENT_WORK_ROOT}/propagatedconf'},
> >> >> u'server': {}}}
> >> >> INFO 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:114 - Base
> >> >> dir:
> >> >>
> >>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package
> >> >>
> >> >> For some reason, the STOP command attempts to pick up the script from
> >> >> the container specific location, where the STATUS command goes
> through
> >> >> an entirely different path (I am not sure though if this is the cause
> >> >> of the issue). Any more pointers to debug this would be really
> >> >> helpful.
> >> >>
> >> >> (For reference, platform: OS X, Python 2.7.11)
> >> >>
> >> >> Thank you
> >> >> Sarthak
> >> >>
> >>
>

Re: [slider-agent] Different "base-dir" for STOP and STATUS commands

Posted by Sarthak Kukreti <sk...@ncsu.edu>.
Thanks! That was helpful. (Strangely) As it turns out, the container
is released (and cleaned up) even before the STOP command is queued.
Some more logs:

Node Manager:
---------------------
2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState - Role
ConnectD flexed from 2 to 1
2016-07-07 15:50:14,148 [AmExecutor-006] WARN  state.AppState -
Resource requirements of ConnectD normalized from <memory:2000,
vCores:1> to <memory:2048, vCores:1>
2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
Resetting failure count of slider-appmaster; was 0
2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
Resetting failure count of ConnectD; was 0
2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.RoleHistory -
Resetting failure history
2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
Reviewing RoleStatus{name='ConnectD', key=1, desired=1, actual=2,
requested=0, releasing=0, failed=0, startFailed=0, started=2,
completed=0, totalRequested=2, preempted=0, nodeFailed=0,
failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:2048,
vCores:1>, isAntiAffinePlacement=false, failureMessage='',
providerRole=ProviderRole{name='ConnectD', id=1, placementPolicy=1,
nodeFailureThreshold=3, placementTimeoutSeconds=30,
labelExpression='null'}} :
2016-07-07 15:50:14,148 [AmExecutor-006] INFO  state.AppState -
ConnectD: Asking for 1 fewer node(s) for a total of 1
2016-07-07 15:50:14,150 [AmExecutor-006] INFO  state.AppState -
Targeting for release: RoleInstance{role='ConnectD',
id='container_1467829690678_0031_01_000003',
container=ContainerID=container_1467829690678_0031_01_000003
nodeID=192.168.1.195:58591 http=192.168.1.195:8042 priority=1073741825
resource=<memory:2048, vCores:1>, createTime=1467931727650,
startTime=1467931727656, released=false, roleId=1, host=192.168.1.195,
hostURL=http://192.168.1.195:8042, state=3, placement=null,
exitCode=0, command='python ./infra/agent/slider-agent/agent/main.py
--label container_1467829690678_0031_01_000003___ConnectD --zk-quorum
localhost:2181 --zk-reg-path
/registry/users/sarthakk/services/org-apache-slider/kc >
<LOG_DIR>/slider-agent.out 2>&1 ; ', diagnostics='null', output=null,
environment=[LANGUAGE="en_US.UTF-8", HADOOP_USER_NAME="sarthakk",
PYTHONPATH="./infra/agent/slider-agent/", AGENT_LOG_ROOT="<LOG_DIR>",
SLIDER_PASSPHRASE="QhURcbCISl0zGwdgkbpbIBKAlIJOmsgKkRgmvuq5ytVDlguFa0",
LC_ALL="en_US.UTF-8", AGENT_WORK_ROOT="$PWD", LANG="en_US.UTF-8"]}

2016-07-07 15:50:14,975 [AMRM Callback Handler Thread] INFO
appmaster.SliderAppMaster - onContainersCompleted([1]
2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
appmaster.SliderAppMaster - Container Completion for
containerID=container_1467829690678_0031_01_000003, state=COMPLETE,
exitStatus=-100, diagnostics=Container released by application
2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
state.AppState - Container was queued for release :
container_1467829690678_0031_01_000003
2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
state.AppState - decrementing role count for role ConnectD to 1;
releasing=0, completed=1
2016-07-07 15:50:14,976 [AMRM Callback Handler Thread] INFO
state.RoleHistory - Finished container for node 1, released=true,
shortlived=false
2016-07-07 15:50:14,987 [AMRM Callback Handler Thread] INFO
state.AppState - Removing node ID
container_1467829690678_0031_01_000003
2016-07-07 15:50:14,988 [AMRM Callback Handler Thread] INFO
agent.AgentProviderService - Removing component status for label
container_1467829690678_0031_01_000003___ConnectD

2016-07-07 15:50:14,988 [AmExecutor-006] INFO
appmaster.SliderAppMaster - Unregistering component
container_1467829690678_0031_01_000003
2016-07-07 15:50:14,989 [AmExecutor-006] INFO  state.AppState -
Reviewing RoleStatus{name='ConnectD', key=1, desired=1, actual=1,
requested=0, releasing=0, failed=0, startFailed=0, started=2,
completed=1, totalRequested=2, preempted=0, nodeFailed=0,
failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:2048,
vCores:1>, isAntiAffinePlacement=false, failureMessage='',
providerRole=ProviderRole{name='ConnectD', id=1, placementPolicy=1,
nodeFailureThreshold=3, placementTimeoutSeconds=30,
labelExpression='null'}} :
2016-07-07 15:50:23,715 [430377746@qtp-1726515679-4] INFO
agent.AgentProviderService - Sending terminate signal to completed
container (still heartbeating):
container_1467829690678_0031_01_000003___ConnectD
2016-07-07 16:12:43,891 [AmExecutor-006] INFO  state.AppState -
Resetting failure count of slider-appmaster; was 0
2016-07-07 16:12:43,892 [AmExecutor-006] INFO  state.AppState -
Resetting failure count of ConnectD; was 0
2016-07-07 16:12:43,892 [AmExecutor-006] INFO  state.RoleHistory -
Resetting failure history


slider-agent Logs:
---------------------
INFO 2016-07-07 15:50:23,720 ActionQueue.py:170 - Running command:
{u'roleCommand': u'STOP', u'clusterName': u'kc', u'componentName':
u'ConnectD', u'hostname': u'192.168.1.195', u'hostLevelParams':
{u'java_home': u'${JAVA_HOME}', u'container_id':
u'container_1467829690678_0031_01_000003'}, u'commandType':
u'EXECUTION_COMMAND', u'roleParams': {u'auto_restart': u'false'},
u'serviceName': u'kc', u'role': u'ConnectD', u'commandParams':
{u'record_config': u'true', u'service_package_folder':
u'${AGENT_WORK_ROOT}/work/app/definition/package', u'script':
u'scripts/kafka.py', u'schema_version': u'2.0', u'command_timeout':
u'600', u'script_type': u'PYTHON'}, u'taskId': 4, u'commandId':
u'4-1', u'containers': [], u'configurations': {u'global':
{u'security_enabled': u'false', u'app_container_id':
u'container_1467829690678_0031_01_000003', u'listen_port': u'51670',
u'app_root': u'${AGENT_WORK_ROOT}/app/install', u'app_log_dir':
u'${AGENT_LOG_ROOT}', u'kc_version': u'1.0.0', u'app_pid_dir':
u'${AGENT_WORK_ROOT}/app/run', u'app_container_tag': u'2',
u'pid_file': u'${AGENT_WORK_ROOT}/app/run/kc.pid', u'app_install_dir':
u'${AGENT_WORK_ROOT}/app/install', u'app_user': u'sarthakk',
u'app_input_conf_dir': u'${AGENT_WORK_ROOT}/propagatedconf'},
u'server': {}}}
INFO 2016-07-07 15:50:23,720 CustomServiceOrchestrator.py:114 - Base
dir: /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/app/definition/package

ERROR 2016-07-07 15:50:23,720 CustomServiceOrchestrator.py:169 -
Caught an exception while executing command: <class
'AgentException.AgentException'>: 'Script
/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/app/definition/package/scripts/kafka.py
does not exist'
Traceback (most recent call last):
  File "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
line 115, in runCommand
    script_path = self.resolve_script_path(self.base_dir, script, script_type)
  File "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
line 199, in resolve_script_path
    raise AgentException(message)
AgentException: 'Script
/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/app/definition/package/scripts/kafka.py
does not exist'
INFO 2016-07-07 15:50:23,721 ActionQueue.py:188 - Stop command received
INFO 2016-07-07 15:50:23,824 Controller.py:223 - Terminate agent
command received from AM, stopping the agent ...
INFO 2016-07-07 15:50:23,824 ProcessHelper.py:39 - Removing pid file
WARNING 2016-07-07 15:50:23,825 ProcessHelper.py:44 - Unable to remove
pid file: [Errno 2] No such file or directory:
'/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0031/container_1467829690678_0031_01_000003/infra/run/agent.pid'
INFO 2016-07-07 15:50:23,825 ProcessHelper.py:46 - Removing temp files

TL; DR: The callback to signal completion (and un-registration) of the
container comes up around 15:50:14 whereas the call to
CustomServiceOrchestrator (triggered by the terminate signal from the
slider AM) is received around 15:50:23. I guess I am missing how the
synchronization between termination of a container and cleanup of all
its resources is handled by Slider (or whether this is handled by YARN
itself).

- Sarthak

On Thu, Jul 7, 2016 at 2:48 PM, Billie Rinaldi <bi...@gmail.com> wrote:
> If you look for the container ID in the nodemanager log on the host where
> the container was running, you should be able to see when the container
> stopped and was cleaned up. Looks like it even logs when it deletes the
> container directories.
>
> On Thu, Jul 7, 2016 at 2:04 PM, Sarthak Kukreti <sk...@ncsu.edu> wrote:
>
>> kafka,py is still present in the filecache directory: its just the
>> "container_1467829690678_0022_01_000003" directory that seems to be
>> deleted before the runCommand() call
>>
>> - Sarthak
>>
>> On Thu, Jul 7, 2016 at 12:35 PM, Billie Rinaldi
>> <bi...@gmail.com> wrote:
>> > I think that
>> >
>> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition
>> > is linked to
>> >
>> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip,
>> > so those are actually the same directory. I am not sure why it is saying
>> > kafka.py does not exist when stopping the container; it definitely should
>> > not clean up that directory while a container is still running. Can you
>> > verify that app/definition/package/scripts/kafka.py exists for one of the
>> > containers that is running?
>> >
>> > On Thu, Jul 7, 2016 at 11:50 AM, Sarthak Kukreti <sk...@ncsu.edu>
>> wrote:
>> >
>> >> Hello!
>> >>
>> >> I am trying to use Slider to distribute an application over a YARN
>> >> cluster. While attempting to use "slider flex" to decrease the number
>> >> of containers allocated for the application (using the kafka
>> >> app-package as reference), I came across the following error:
>> >>
>> >> ERROR 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:169 -
>> >> Caught an exception while executing command: <class
>> >> 'AgentException.AgentException'>: 'Script
>> >>
>> >>
>> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py
>> >> does not exist'
>> >> Traceback (most recent call last):
>> >>   File
>> >>
>> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
>> >> line 115, in runCommand
>> >>     script_path = self.resolve_script_path(self.base_dir, script,
>> >> script_type)
>> >>   File
>> >>
>> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
>> >> line 199, in resolve_script_path
>> >>     raise AgentException(message)
>> >> AgentException: 'Script
>> >>
>> >>
>> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py
>> >> does not exist'
>> >>
>> >>
>> >> (Seems like the directory is cleared up before the command)
>> >>
>> >> Additionally, I tried adding debug prints in the
>> >> CustomServiceOrchestrator to see what base directory is used for
>> >> invoking the script and found that the base directories for STATUS and
>> >> STOP command differ:
>> >>
>> >> STATUS command:
>> >>
>> >> INFO 2016-07-07 10:56:31,323 AgentToggleLogger.py:40 - Adding
>> >> STATUS_COMMAND for service kc of cluster kc to the queue.
>> >> INFO 2016-07-07 10:56:31,327 CustomServiceOrchestrator.py:114 - Base
>> >> dir:
>> >>
>> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip/package
>> >>
>> >>
>> >> STOP command:
>> >>
>> >> INFO 2016-07-07 10:57:36,455 AgentToggleLogger.py:40 - Adding
>> >> EXECUTION_COMMAND for service kc of cluster kc to the queue.
>> >> INFO 2016-07-07 10:57:36,456 Controller.py:251 - Attempting to
>> >> gracefully stop the application ...
>> >> INFO 2016-07-07 10:57:36,458 ActionQueue.py:134 - Package received:
>> >> INFO 2016-07-07 10:57:36,458 ActionQueue.py:140 - Executing command
>> >> with id = 4-1 for role = Hello of cluster kc
>> >> INFO 2016-07-07 10:57:36,460 ActionQueue.py:170 - Running command:
>> >> {u'roleCommand': u'STOP', u'clusterName': u'kc', u'componentName':
>> >> u'Hello', u'hostname': u'192.168.1.195', u'hostLevelParams':
>> >> {u'java_home':
>> >> u'/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home/',
>> >> u'container_id': u'container_1467829690678_0022_01_000003'},
>> >> u'commandType': u'EXECUTION_COMMAND', u'roleParams': {u'auto_restart':
>> >> u'false'}, u'serviceName': u'kc', u'role': u'Hello', u'commandParams':
>> >> {u'record_config': u'true', u'service_package_folder':
>> >> u'${AGENT_WORK_ROOT}/work/app/definition/package', u'script':
>> >> u'scripts/kafka.py', u'schema_version': u'2.0', u'command_timeout':
>> >> u'600', u'script_type': u'PYTHON'}, u'taskId': 4, u'commandId':
>> >> u'4-1', u'containers': [], u'configurations': {u'global':
>> >> {u'security_enabled': u'false', u'app_container_id':
>> >> u'container_1467829690678_0022_01_000003', u'listen_port': u'52508',
>> >> u'app_root': u'${AGENT_WORK_ROOT}/app/install', u'app_log_dir':
>> >> u'${AGENT_LOG_ROOT}', u'kc_version': u'1.0.0', u'app_pid_dir':
>> >> u'${AGENT_WORK_ROOT}/app/run', u'app_container_tag': u'2',
>> >> u'pid_file': u'${AGENT_WORK_ROOT}/app/run/kc.pid', u'app_install_dir':
>> >> u'${AGENT_WORK_ROOT}/app/install', u'app_user': u'sarthakk',
>> >> u'app_input_conf_dir': u'${AGENT_WORK_ROOT}/propagatedconf'},
>> >> u'server': {}}}
>> >> INFO 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:114 - Base
>> >> dir:
>> >>
>> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package
>> >>
>> >> For some reason, the STOP command attempts to pick up the script from
>> >> the container specific location, where the STATUS command goes through
>> >> an entirely different path (I am not sure though if this is the cause
>> >> of the issue). Any more pointers to debug this would be really
>> >> helpful.
>> >>
>> >> (For reference, platform: OS X, Python 2.7.11)
>> >>
>> >> Thank you
>> >> Sarthak
>> >>
>>

Re: [slider-agent] Different "base-dir" for STOP and STATUS commands

Posted by Billie Rinaldi <bi...@gmail.com>.
If you look for the container ID in the nodemanager log on the host where
the container was running, you should be able to see when the container
stopped and was cleaned up. Looks like it even logs when it deletes the
container directories.

On Thu, Jul 7, 2016 at 2:04 PM, Sarthak Kukreti <sk...@ncsu.edu> wrote:

> kafka,py is still present in the filecache directory: its just the
> "container_1467829690678_0022_01_000003" directory that seems to be
> deleted before the runCommand() call
>
> - Sarthak
>
> On Thu, Jul 7, 2016 at 12:35 PM, Billie Rinaldi
> <bi...@gmail.com> wrote:
> > I think that
> >
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition
> > is linked to
> >
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip,
> > so those are actually the same directory. I am not sure why it is saying
> > kafka.py does not exist when stopping the container; it definitely should
> > not clean up that directory while a container is still running. Can you
> > verify that app/definition/package/scripts/kafka.py exists for one of the
> > containers that is running?
> >
> > On Thu, Jul 7, 2016 at 11:50 AM, Sarthak Kukreti <sk...@ncsu.edu>
> wrote:
> >
> >> Hello!
> >>
> >> I am trying to use Slider to distribute an application over a YARN
> >> cluster. While attempting to use "slider flex" to decrease the number
> >> of containers allocated for the application (using the kafka
> >> app-package as reference), I came across the following error:
> >>
> >> ERROR 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:169 -
> >> Caught an exception while executing command: <class
> >> 'AgentException.AgentException'>: 'Script
> >>
> >>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py
> >> does not exist'
> >> Traceback (most recent call last):
> >>   File
> >>
> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
> >> line 115, in runCommand
> >>     script_path = self.resolve_script_path(self.base_dir, script,
> >> script_type)
> >>   File
> >>
> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
> >> line 199, in resolve_script_path
> >>     raise AgentException(message)
> >> AgentException: 'Script
> >>
> >>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py
> >> does not exist'
> >>
> >>
> >> (Seems like the directory is cleared up before the command)
> >>
> >> Additionally, I tried adding debug prints in the
> >> CustomServiceOrchestrator to see what base directory is used for
> >> invoking the script and found that the base directories for STATUS and
> >> STOP command differ:
> >>
> >> STATUS command:
> >>
> >> INFO 2016-07-07 10:56:31,323 AgentToggleLogger.py:40 - Adding
> >> STATUS_COMMAND for service kc of cluster kc to the queue.
> >> INFO 2016-07-07 10:56:31,327 CustomServiceOrchestrator.py:114 - Base
> >> dir:
> >>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip/package
> >>
> >>
> >> STOP command:
> >>
> >> INFO 2016-07-07 10:57:36,455 AgentToggleLogger.py:40 - Adding
> >> EXECUTION_COMMAND for service kc of cluster kc to the queue.
> >> INFO 2016-07-07 10:57:36,456 Controller.py:251 - Attempting to
> >> gracefully stop the application ...
> >> INFO 2016-07-07 10:57:36,458 ActionQueue.py:134 - Package received:
> >> INFO 2016-07-07 10:57:36,458 ActionQueue.py:140 - Executing command
> >> with id = 4-1 for role = Hello of cluster kc
> >> INFO 2016-07-07 10:57:36,460 ActionQueue.py:170 - Running command:
> >> {u'roleCommand': u'STOP', u'clusterName': u'kc', u'componentName':
> >> u'Hello', u'hostname': u'192.168.1.195', u'hostLevelParams':
> >> {u'java_home':
> >> u'/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home/',
> >> u'container_id': u'container_1467829690678_0022_01_000003'},
> >> u'commandType': u'EXECUTION_COMMAND', u'roleParams': {u'auto_restart':
> >> u'false'}, u'serviceName': u'kc', u'role': u'Hello', u'commandParams':
> >> {u'record_config': u'true', u'service_package_folder':
> >> u'${AGENT_WORK_ROOT}/work/app/definition/package', u'script':
> >> u'scripts/kafka.py', u'schema_version': u'2.0', u'command_timeout':
> >> u'600', u'script_type': u'PYTHON'}, u'taskId': 4, u'commandId':
> >> u'4-1', u'containers': [], u'configurations': {u'global':
> >> {u'security_enabled': u'false', u'app_container_id':
> >> u'container_1467829690678_0022_01_000003', u'listen_port': u'52508',
> >> u'app_root': u'${AGENT_WORK_ROOT}/app/install', u'app_log_dir':
> >> u'${AGENT_LOG_ROOT}', u'kc_version': u'1.0.0', u'app_pid_dir':
> >> u'${AGENT_WORK_ROOT}/app/run', u'app_container_tag': u'2',
> >> u'pid_file': u'${AGENT_WORK_ROOT}/app/run/kc.pid', u'app_install_dir':
> >> u'${AGENT_WORK_ROOT}/app/install', u'app_user': u'sarthakk',
> >> u'app_input_conf_dir': u'${AGENT_WORK_ROOT}/propagatedconf'},
> >> u'server': {}}}
> >> INFO 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:114 - Base
> >> dir:
> >>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package
> >>
> >> For some reason, the STOP command attempts to pick up the script from
> >> the container specific location, where the STATUS command goes through
> >> an entirely different path (I am not sure though if this is the cause
> >> of the issue). Any more pointers to debug this would be really
> >> helpful.
> >>
> >> (For reference, platform: OS X, Python 2.7.11)
> >>
> >> Thank you
> >> Sarthak
> >>
>

Re: [slider-agent] Different "base-dir" for STOP and STATUS commands

Posted by Sarthak Kukreti <sk...@ncsu.edu>.
kafka,py is still present in the filecache directory: its just the
"container_1467829690678_0022_01_000003" directory that seems to be
deleted before the runCommand() call

- Sarthak

On Thu, Jul 7, 2016 at 12:35 PM, Billie Rinaldi
<bi...@gmail.com> wrote:
> I think that
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition
> is linked to
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip,
> so those are actually the same directory. I am not sure why it is saying
> kafka.py does not exist when stopping the container; it definitely should
> not clean up that directory while a container is still running. Can you
> verify that app/definition/package/scripts/kafka.py exists for one of the
> containers that is running?
>
> On Thu, Jul 7, 2016 at 11:50 AM, Sarthak Kukreti <sk...@ncsu.edu> wrote:
>
>> Hello!
>>
>> I am trying to use Slider to distribute an application over a YARN
>> cluster. While attempting to use "slider flex" to decrease the number
>> of containers allocated for the application (using the kafka
>> app-package as reference), I came across the following error:
>>
>> ERROR 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:169 -
>> Caught an exception while executing command: <class
>> 'AgentException.AgentException'>: 'Script
>>
>> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py
>> does not exist'
>> Traceback (most recent call last):
>>   File
>> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
>> line 115, in runCommand
>>     script_path = self.resolve_script_path(self.base_dir, script,
>> script_type)
>>   File
>> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
>> line 199, in resolve_script_path
>>     raise AgentException(message)
>> AgentException: 'Script
>>
>> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py
>> does not exist'
>>
>>
>> (Seems like the directory is cleared up before the command)
>>
>> Additionally, I tried adding debug prints in the
>> CustomServiceOrchestrator to see what base directory is used for
>> invoking the script and found that the base directories for STATUS and
>> STOP command differ:
>>
>> STATUS command:
>>
>> INFO 2016-07-07 10:56:31,323 AgentToggleLogger.py:40 - Adding
>> STATUS_COMMAND for service kc of cluster kc to the queue.
>> INFO 2016-07-07 10:56:31,327 CustomServiceOrchestrator.py:114 - Base
>> dir:
>> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip/package
>>
>>
>> STOP command:
>>
>> INFO 2016-07-07 10:57:36,455 AgentToggleLogger.py:40 - Adding
>> EXECUTION_COMMAND for service kc of cluster kc to the queue.
>> INFO 2016-07-07 10:57:36,456 Controller.py:251 - Attempting to
>> gracefully stop the application ...
>> INFO 2016-07-07 10:57:36,458 ActionQueue.py:134 - Package received:
>> INFO 2016-07-07 10:57:36,458 ActionQueue.py:140 - Executing command
>> with id = 4-1 for role = Hello of cluster kc
>> INFO 2016-07-07 10:57:36,460 ActionQueue.py:170 - Running command:
>> {u'roleCommand': u'STOP', u'clusterName': u'kc', u'componentName':
>> u'Hello', u'hostname': u'192.168.1.195', u'hostLevelParams':
>> {u'java_home':
>> u'/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home/',
>> u'container_id': u'container_1467829690678_0022_01_000003'},
>> u'commandType': u'EXECUTION_COMMAND', u'roleParams': {u'auto_restart':
>> u'false'}, u'serviceName': u'kc', u'role': u'Hello', u'commandParams':
>> {u'record_config': u'true', u'service_package_folder':
>> u'${AGENT_WORK_ROOT}/work/app/definition/package', u'script':
>> u'scripts/kafka.py', u'schema_version': u'2.0', u'command_timeout':
>> u'600', u'script_type': u'PYTHON'}, u'taskId': 4, u'commandId':
>> u'4-1', u'containers': [], u'configurations': {u'global':
>> {u'security_enabled': u'false', u'app_container_id':
>> u'container_1467829690678_0022_01_000003', u'listen_port': u'52508',
>> u'app_root': u'${AGENT_WORK_ROOT}/app/install', u'app_log_dir':
>> u'${AGENT_LOG_ROOT}', u'kc_version': u'1.0.0', u'app_pid_dir':
>> u'${AGENT_WORK_ROOT}/app/run', u'app_container_tag': u'2',
>> u'pid_file': u'${AGENT_WORK_ROOT}/app/run/kc.pid', u'app_install_dir':
>> u'${AGENT_WORK_ROOT}/app/install', u'app_user': u'sarthakk',
>> u'app_input_conf_dir': u'${AGENT_WORK_ROOT}/propagatedconf'},
>> u'server': {}}}
>> INFO 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:114 - Base
>> dir:
>> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package
>>
>> For some reason, the STOP command attempts to pick up the script from
>> the container specific location, where the STATUS command goes through
>> an entirely different path (I am not sure though if this is the cause
>> of the issue). Any more pointers to debug this would be really
>> helpful.
>>
>> (For reference, platform: OS X, Python 2.7.11)
>>
>> Thank you
>> Sarthak
>>

Re: [slider-agent] Different "base-dir" for STOP and STATUS commands

Posted by Billie Rinaldi <bi...@gmail.com>.
I think that
/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition
is linked to
/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip,
so those are actually the same directory. I am not sure why it is saying
kafka.py does not exist when stopping the container; it definitely should
not clean up that directory while a container is still running. Can you
verify that app/definition/package/scripts/kafka.py exists for one of the
containers that is running?

On Thu, Jul 7, 2016 at 11:50 AM, Sarthak Kukreti <sk...@ncsu.edu> wrote:

> Hello!
>
> I am trying to use Slider to distribute an application over a YARN
> cluster. While attempting to use "slider flex" to decrease the number
> of containers allocated for the application (using the kafka
> app-package as reference), I came across the following error:
>
> ERROR 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:169 -
> Caught an exception while executing command: <class
> 'AgentException.AgentException'>: 'Script
>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py
> does not exist'
> Traceback (most recent call last):
>   File
> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
> line 115, in runCommand
>     script_path = self.resolve_script_path(self.base_dir, script,
> script_type)
>   File
> "/private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/71/slider-agent.tar.gz/slider-agent/agent/CustomServiceOrchestrator.py",
> line 199, in resolve_script_path
>     raise AgentException(message)
> AgentException: 'Script
>
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package/scripts/kafka.py
> does not exist'
>
>
> (Seems like the directory is cleared up before the command)
>
> Additionally, I tried adding debug prints in the
> CustomServiceOrchestrator to see what base directory is used for
> invoking the script and found that the base directories for STATUS and
> STOP command differ:
>
> STATUS command:
>
> INFO 2016-07-07 10:56:31,323 AgentToggleLogger.py:40 - Adding
> STATUS_COMMAND for service kc of cluster kc to the queue.
> INFO 2016-07-07 10:56:31,327 CustomServiceOrchestrator.py:114 - Base
> dir:
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/filecache/113/slider-kafka-package-1.0.0.zip/package
>
>
> STOP command:
>
> INFO 2016-07-07 10:57:36,455 AgentToggleLogger.py:40 - Adding
> EXECUTION_COMMAND for service kc of cluster kc to the queue.
> INFO 2016-07-07 10:57:36,456 Controller.py:251 - Attempting to
> gracefully stop the application ...
> INFO 2016-07-07 10:57:36,458 ActionQueue.py:134 - Package received:
> INFO 2016-07-07 10:57:36,458 ActionQueue.py:140 - Executing command
> with id = 4-1 for role = Hello of cluster kc
> INFO 2016-07-07 10:57:36,460 ActionQueue.py:170 - Running command:
> {u'roleCommand': u'STOP', u'clusterName': u'kc', u'componentName':
> u'Hello', u'hostname': u'192.168.1.195', u'hostLevelParams':
> {u'java_home':
> u'/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home/',
> u'container_id': u'container_1467829690678_0022_01_000003'},
> u'commandType': u'EXECUTION_COMMAND', u'roleParams': {u'auto_restart':
> u'false'}, u'serviceName': u'kc', u'role': u'Hello', u'commandParams':
> {u'record_config': u'true', u'service_package_folder':
> u'${AGENT_WORK_ROOT}/work/app/definition/package', u'script':
> u'scripts/kafka.py', u'schema_version': u'2.0', u'command_timeout':
> u'600', u'script_type': u'PYTHON'}, u'taskId': 4, u'commandId':
> u'4-1', u'containers': [], u'configurations': {u'global':
> {u'security_enabled': u'false', u'app_container_id':
> u'container_1467829690678_0022_01_000003', u'listen_port': u'52508',
> u'app_root': u'${AGENT_WORK_ROOT}/app/install', u'app_log_dir':
> u'${AGENT_LOG_ROOT}', u'kc_version': u'1.0.0', u'app_pid_dir':
> u'${AGENT_WORK_ROOT}/app/run', u'app_container_tag': u'2',
> u'pid_file': u'${AGENT_WORK_ROOT}/app/run/kc.pid', u'app_install_dir':
> u'${AGENT_WORK_ROOT}/app/install', u'app_user': u'sarthakk',
> u'app_input_conf_dir': u'${AGENT_WORK_ROOT}/propagatedconf'},
> u'server': {}}}
> INFO 2016-07-07 10:57:36,461 CustomServiceOrchestrator.py:114 - Base
> dir:
> /private/tmp/hdfs/nm-local-dir/usercache/sarthakk/appcache/application_1467829690678_0022/container_1467829690678_0022_01_000003/app/definition/package
>
> For some reason, the STOP command attempts to pick up the script from
> the container specific location, where the STATUS command goes through
> an entirely different path (I am not sure though if this is the cause
> of the issue). Any more pointers to debug this would be really
> helpful.
>
> (For reference, platform: OS X, Python 2.7.11)
>
> Thank you
> Sarthak
>