You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by Robert Grandl <rg...@yahoo.com.INVALID> on 2018/11/07 05:00:13 UTC

Run Distributed TensorFlow on YARN

 Hi all,
I am wondering if there is any stable support to run distributed TensorFlow atop YARN at the moment. 
I found this blog post from Hortonworks. It seems this it is possible starting YARN 3.1.0.https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/


Also I found some more recent JIRAs:https://issues.apache.org/jira/browse/YARN-8220
https://issues.apache.org/jira/browse/YARN-8135
which suggests to use something called submarine.

However, I could not find any proper documentation or instructions to use any of these.

Can someone help me with this? 
Otherwise, it is any better support to run any other machine learning framework with YARN? 
Thank you in advance,- Robert

Re: Run Distributed TensorFlow on YARN

Posted by Wangda Tan <wh...@gmail.com>.
Forgot to add Xun in my last email.

On Thu, Nov 8, 2018 at 11:55 AM Wangda Tan <wh...@gmail.com> wrote:

> Hi Robert,
>
> Submarine in 3.2.0 only support Docker container runtime, and in future
> releases (maybe 3.2.1), we plan to add support for non-docker containers.
>
> In order to try Submarine, you need to properly configure docker-on-yarn
> first.
>
> You can check
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptEN.md
> for installation guide about how to properly setup Docker container on
> multiple containers. Submarine embedded an interactive shell to help you
> set up this should be straightforward. Added Xun Liu who is the original
> author for the installation interactive shell.
>
> Once you get Docker on YARN properly set up, you can follow
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/QuickStart.md
> to run the first application.
>
> Also, you can check Submarine slides to better understand how it works.
> See: https://www.dropbox.com/s/wuv19b3rt9k2kq6/submarine-v0.pptx?dl=0
>
> Any questions please don't hesitate to let us know.
>
> Thanks,
> Wangda
>
>
>
> On Thu, Nov 8, 2018 at 10:12 AM Robert Grandl <rg...@yahoo.com.invalid>
> wrote:
>
>>  Thanks a lot for your reply.
>> Sunil,
>> I was trying to follow the steps from:
>> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/RunningDistributedCifar10TFJobs.md
>>
>> to run the tensorflow standalone using submarine. I have installed hadoop
>> 3.3.0-SNAPSHOT.
>> However, when I run the:yarn jar
>> path/to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
>>    job run --name tf-job-001 --verbose --docker_image
>> hadoopsubmarine/tf-1.8.0-gpu:0.0.1 \
>>    --input_path hdfs://default/dataset/cifar-10-data \
>>    --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
>>    --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0
>>    --num_workers 1 --worker_resources memory=8G,vcores=2,gpu=1 \
>>    --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator
>> && python cifar10_main.py --data-dir=%input_path%
>> --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16
>> --train-batch-size=16 --num-gpus=2 --sync" \
>>    --tensorboard --tensorboard_docker_image wtan/tf-1.8.0-cpu:0.0.3
>> command, I get the following error:2018-11-07 21:48:55,831 INFO  [main]
>> client.AHSProxy (AHSProxy.java:createAHSProxy(42)) - Connecting to
>> Application History server at /128.105.144.236:10200Exception in thread
>> "main" java.lang.IllegalArgumentException: Unacceptable no of cpus
>> specified, either zero or negative for component master (or at the global
>> level)        at
>> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateServiceResource(ServiceApiUtil.java:457)
>>       at
>> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateComponent(ServiceApiUtil.java:306)
>>       at
>> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:237)
>>       at
>> org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:496)
>>       at
>> org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
>>       at
>> org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)
>>       at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)
>>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>       at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>          at java.lang.reflect.Method.invoke(Method.java:498)        at
>> org.apache.hadoop.util.RunJar.run(RunJar.java:323)        at
>> org.apache.hadoop.util.RunJar.main(RunJar.java:236)
>>
>> It seems that I don't configure somewhere some corresponding resources
>> for a master component. However I have a hard time understanding where and
>> what to configure. I also looked at the design document you pointed at:
>> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7
>>
>> and it has a --master_resources flag. However this is not available in
>> 3.3.0.
>> Could you please advise how to proceed with this?
>> Thank you,- Robert
>>
>>     On Tuesday, November 6, 2018, 10:40:20 PM PST, Jonathan Hung <
>> jyhung2357@gmail.com> wrote:
>>
>>  Hi Robert, I also encourage you to check out
>> https://github.com/linkedin/TonY (TensorFlow on YARN) which is a
>> platform built for this purpose.
>>
>> Jonathan
>> ________________________________
>> From: Sunil G <su...@apache.org>
>> Sent: Tuesday, November 6, 2018 10:05:14 PM
>> To: Robert Grandl
>> Cc: yarn-dev@hadoop.apache.org; yarn-dev-help@hadoop.apache.org; General
>> Subject: Re: Run Distributed TensorFlow on YARN
>>
>> Hi Robert
>>
>> {Submarine} project helps to run Distributed Tensorflow on top of YARN
>> with
>> ease. YARN-8220 <https://issues.apache.org/jira/browse/YARN-8220> was an
>> early attempt to do the same with some scripts etc, but Submarine will
>> help
>> to avoid all such custom scripts etc, and rather can simply run tensorflow
>> like a distributed shell command line by using Submarine jar. Pls refer
>> below doc for deep dive.
>>
>> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7
>>
>> Submarine will be released as part of Hadoop 3.2.0 release which will be
>> out very soon officially (in coming weeks). you are free to use hadoop
>> trunk to run same if you need very soon.
>>
>> For now you can refer submarine docs under hadoop repo (trunk)
>> under
>> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/
>> or(
>>
>> https://github.com/apache/hadoop/tree/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown
>> )
>>
>> Thanks
>> Sunil
>>
>>
>> On Wed, Nov 7, 2018 at 10:34 AM Robert Grandl <rg...@yahoo.com.invalid>
>> wrote:
>>
>> >  Hi all,
>> > I am wondering if there is any stable support to run distributed
>> > TensorFlow atop YARN at the moment.
>> > I found this blog post from Hortonworks. It seems this it is possible
>> > starting YARN 3.1.0.
>> >
>> https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/
>> >
>> >
>> > Also I found some more recent JIRAs:
>> > https://issues.apache.org/jira/browse/YARN-8220
>> > https://issues.apache.org/jira/browse/YARN-8135
>> > which suggests to use something called submarine.
>> >
>> > However, I could not find any proper documentation or instructions to
>> use
>> > any of these.
>> >
>> > Can someone help me with this?
>> > Otherwise, it is any better support to run any other machine learning
>> > framework with YARN?
>> > Thank you in advance,- Robert
>> >
>
>

Re: Run Distributed TensorFlow on YARN

Posted by Wangda Tan <wh...@gmail.com>.
Forgot to add Xun in my last email.

On Thu, Nov 8, 2018 at 11:55 AM Wangda Tan <wh...@gmail.com> wrote:

> Hi Robert,
>
> Submarine in 3.2.0 only support Docker container runtime, and in future
> releases (maybe 3.2.1), we plan to add support for non-docker containers.
>
> In order to try Submarine, you need to properly configure docker-on-yarn
> first.
>
> You can check
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptEN.md
> for installation guide about how to properly setup Docker container on
> multiple containers. Submarine embedded an interactive shell to help you
> set up this should be straightforward. Added Xun Liu who is the original
> author for the installation interactive shell.
>
> Once you get Docker on YARN properly set up, you can follow
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/QuickStart.md
> to run the first application.
>
> Also, you can check Submarine slides to better understand how it works.
> See: https://www.dropbox.com/s/wuv19b3rt9k2kq6/submarine-v0.pptx?dl=0
>
> Any questions please don't hesitate to let us know.
>
> Thanks,
> Wangda
>
>
>
> On Thu, Nov 8, 2018 at 10:12 AM Robert Grandl <rg...@yahoo.com.invalid>
> wrote:
>
>>  Thanks a lot for your reply.
>> Sunil,
>> I was trying to follow the steps from:
>> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/RunningDistributedCifar10TFJobs.md
>>
>> to run the tensorflow standalone using submarine. I have installed hadoop
>> 3.3.0-SNAPSHOT.
>> However, when I run the:yarn jar
>> path/to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
>>    job run --name tf-job-001 --verbose --docker_image
>> hadoopsubmarine/tf-1.8.0-gpu:0.0.1 \
>>    --input_path hdfs://default/dataset/cifar-10-data \
>>    --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
>>    --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0
>>    --num_workers 1 --worker_resources memory=8G,vcores=2,gpu=1 \
>>    --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator
>> && python cifar10_main.py --data-dir=%input_path%
>> --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16
>> --train-batch-size=16 --num-gpus=2 --sync" \
>>    --tensorboard --tensorboard_docker_image wtan/tf-1.8.0-cpu:0.0.3
>> command, I get the following error:2018-11-07 21:48:55,831 INFO  [main]
>> client.AHSProxy (AHSProxy.java:createAHSProxy(42)) - Connecting to
>> Application History server at /128.105.144.236:10200Exception in thread
>> "main" java.lang.IllegalArgumentException: Unacceptable no of cpus
>> specified, either zero or negative for component master (or at the global
>> level)        at
>> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateServiceResource(ServiceApiUtil.java:457)
>>       at
>> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateComponent(ServiceApiUtil.java:306)
>>       at
>> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:237)
>>       at
>> org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:496)
>>       at
>> org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
>>       at
>> org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)
>>       at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)
>>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>       at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>          at java.lang.reflect.Method.invoke(Method.java:498)        at
>> org.apache.hadoop.util.RunJar.run(RunJar.java:323)        at
>> org.apache.hadoop.util.RunJar.main(RunJar.java:236)
>>
>> It seems that I don't configure somewhere some corresponding resources
>> for a master component. However I have a hard time understanding where and
>> what to configure. I also looked at the design document you pointed at:
>> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7
>>
>> and it has a --master_resources flag. However this is not available in
>> 3.3.0.
>> Could you please advise how to proceed with this?
>> Thank you,- Robert
>>
>>     On Tuesday, November 6, 2018, 10:40:20 PM PST, Jonathan Hung <
>> jyhung2357@gmail.com> wrote:
>>
>>  Hi Robert, I also encourage you to check out
>> https://github.com/linkedin/TonY (TensorFlow on YARN) which is a
>> platform built for this purpose.
>>
>> Jonathan
>> ________________________________
>> From: Sunil G <su...@apache.org>
>> Sent: Tuesday, November 6, 2018 10:05:14 PM
>> To: Robert Grandl
>> Cc: yarn-dev@hadoop.apache.org; yarn-dev-help@hadoop.apache.org; General
>> Subject: Re: Run Distributed TensorFlow on YARN
>>
>> Hi Robert
>>
>> {Submarine} project helps to run Distributed Tensorflow on top of YARN
>> with
>> ease. YARN-8220 <https://issues.apache.org/jira/browse/YARN-8220> was an
>> early attempt to do the same with some scripts etc, but Submarine will
>> help
>> to avoid all such custom scripts etc, and rather can simply run tensorflow
>> like a distributed shell command line by using Submarine jar. Pls refer
>> below doc for deep dive.
>>
>> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7
>>
>> Submarine will be released as part of Hadoop 3.2.0 release which will be
>> out very soon officially (in coming weeks). you are free to use hadoop
>> trunk to run same if you need very soon.
>>
>> For now you can refer submarine docs under hadoop repo (trunk)
>> under
>> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/
>> or(
>>
>> https://github.com/apache/hadoop/tree/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown
>> )
>>
>> Thanks
>> Sunil
>>
>>
>> On Wed, Nov 7, 2018 at 10:34 AM Robert Grandl <rg...@yahoo.com.invalid>
>> wrote:
>>
>> >  Hi all,
>> > I am wondering if there is any stable support to run distributed
>> > TensorFlow atop YARN at the moment.
>> > I found this blog post from Hortonworks. It seems this it is possible
>> > starting YARN 3.1.0.
>> >
>> https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/
>> >
>> >
>> > Also I found some more recent JIRAs:
>> > https://issues.apache.org/jira/browse/YARN-8220
>> > https://issues.apache.org/jira/browse/YARN-8135
>> > which suggests to use something called submarine.
>> >
>> > However, I could not find any proper documentation or instructions to
>> use
>> > any of these.
>> >
>> > Can someone help me with this?
>> > Otherwise, it is any better support to run any other machine learning
>> > framework with YARN?
>> > Thank you in advance,- Robert
>> >
>
>

Re: Run Distributed TensorFlow on YARN

Posted by Wangda Tan <wh...@gmail.com>.
Hi Robert,

Submarine in 3.2.0 only support Docker container runtime, and in future
releases (maybe 3.2.1), we plan to add support for non-docker containers.

In order to try Submarine, you need to properly configure docker-on-yarn
first.

You can check
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptEN.md
for installation guide about how to properly setup Docker container on
multiple containers. Submarine embedded an interactive shell to help you
set up this should be straightforward. Added Xun Liu who is the original
author for the installation interactive shell.

Once you get Docker on YARN properly set up, you can follow
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/QuickStart.md
to run the first application.

Also, you can check Submarine slides to better understand how it works.
See: https://www.dropbox.com/s/wuv19b3rt9k2kq6/submarine-v0.pptx?dl=0

Any questions please don't hesitate to let us know.

Thanks,
Wangda



On Thu, Nov 8, 2018 at 10:12 AM Robert Grandl <rg...@yahoo.com.invalid>
wrote:

>  Thanks a lot for your reply.
> Sunil,
> I was trying to follow the steps from:
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/RunningDistributedCifar10TFJobs.md
>
> to run the tensorflow standalone using submarine. I have installed hadoop
> 3.3.0-SNAPSHOT.
> However, when I run the:yarn jar
> path/to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
>    job run --name tf-job-001 --verbose --docker_image
> hadoopsubmarine/tf-1.8.0-gpu:0.0.1 \
>    --input_path hdfs://default/dataset/cifar-10-data \
>    --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
>    --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0
>    --num_workers 1 --worker_resources memory=8G,vcores=2,gpu=1 \
>    --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator
> && python cifar10_main.py --data-dir=%input_path%
> --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16
> --train-batch-size=16 --num-gpus=2 --sync" \
>    --tensorboard --tensorboard_docker_image wtan/tf-1.8.0-cpu:0.0.3
> command, I get the following error:2018-11-07 21:48:55,831 INFO  [main]
> client.AHSProxy (AHSProxy.java:createAHSProxy(42)) - Connecting to
> Application History server at /128.105.144.236:10200Exception in thread
> "main" java.lang.IllegalArgumentException: Unacceptable no of cpus
> specified, either zero or negative for component master (or at the global
> level)        at
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateServiceResource(ServiceApiUtil.java:457)
>       at
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateComponent(ServiceApiUtil.java:306)
>       at
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:237)
>       at
> org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:496)
>       at
> org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
>       at
> org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)
>       at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>          at java.lang.reflect.Method.invoke(Method.java:498)        at
> org.apache.hadoop.util.RunJar.run(RunJar.java:323)        at
> org.apache.hadoop.util.RunJar.main(RunJar.java:236)
>
> It seems that I don't configure somewhere some corresponding resources for
> a master component. However I have a hard time understanding where and what
> to configure. I also looked at the design document you pointed at:
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7
>
> and it has a --master_resources flag. However this is not available in
> 3.3.0.
> Could you please advise how to proceed with this?
> Thank you,- Robert
>
>     On Tuesday, November 6, 2018, 10:40:20 PM PST, Jonathan Hung <
> jyhung2357@gmail.com> wrote:
>
>  Hi Robert, I also encourage you to check out
> https://github.com/linkedin/TonY (TensorFlow on YARN) which is a platform
> built for this purpose.
>
> Jonathan
> ________________________________
> From: Sunil G <su...@apache.org>
> Sent: Tuesday, November 6, 2018 10:05:14 PM
> To: Robert Grandl
> Cc: yarn-dev@hadoop.apache.org; yarn-dev-help@hadoop.apache.org; General
> Subject: Re: Run Distributed TensorFlow on YARN
>
> Hi Robert
>
> {Submarine} project helps to run Distributed Tensorflow on top of YARN with
> ease. YARN-8220 <https://issues.apache.org/jira/browse/YARN-8220> was an
> early attempt to do the same with some scripts etc, but Submarine will help
> to avoid all such custom scripts etc, and rather can simply run tensorflow
> like a distributed shell command line by using Submarine jar. Pls refer
> below doc for deep dive.
>
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7
>
> Submarine will be released as part of Hadoop 3.2.0 release which will be
> out very soon officially (in coming weeks). you are free to use hadoop
> trunk to run same if you need very soon.
>
> For now you can refer submarine docs under hadoop repo (trunk)
> under
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/
> or(
>
> https://github.com/apache/hadoop/tree/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown
> )
>
> Thanks
> Sunil
>
>
> On Wed, Nov 7, 2018 at 10:34 AM Robert Grandl <rg...@yahoo.com.invalid>
> wrote:
>
> >  Hi all,
> > I am wondering if there is any stable support to run distributed
> > TensorFlow atop YARN at the moment.
> > I found this blog post from Hortonworks. It seems this it is possible
> > starting YARN 3.1.0.
> >
> https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/
> >
> >
> > Also I found some more recent JIRAs:
> > https://issues.apache.org/jira/browse/YARN-8220
> > https://issues.apache.org/jira/browse/YARN-8135
> > which suggests to use something called submarine.
> >
> > However, I could not find any proper documentation or instructions to use
> > any of these.
> >
> > Can someone help me with this?
> > Otherwise, it is any better support to run any other machine learning
> > framework with YARN?
> > Thank you in advance,- Robert
> >

Re: Run Distributed TensorFlow on YARN

Posted by Wangda Tan <wh...@gmail.com>.
Hi Robert,

Submarine in 3.2.0 only support Docker container runtime, and in future
releases (maybe 3.2.1), we plan to add support for non-docker containers.

In order to try Submarine, you need to properly configure docker-on-yarn
first.

You can check
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptEN.md
for installation guide about how to properly setup Docker container on
multiple containers. Submarine embedded an interactive shell to help you
set up this should be straightforward. Added Xun Liu who is the original
author for the installation interactive shell.

Once you get Docker on YARN properly set up, you can follow
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/QuickStart.md
to run the first application.

Also, you can check Submarine slides to better understand how it works.
See: https://www.dropbox.com/s/wuv19b3rt9k2kq6/submarine-v0.pptx?dl=0

Any questions please don't hesitate to let us know.

Thanks,
Wangda



On Thu, Nov 8, 2018 at 10:12 AM Robert Grandl <rg...@yahoo.com.invalid>
wrote:

>  Thanks a lot for your reply.
> Sunil,
> I was trying to follow the steps from:
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/RunningDistributedCifar10TFJobs.md
>
> to run the tensorflow standalone using submarine. I have installed hadoop
> 3.3.0-SNAPSHOT.
> However, when I run the:yarn jar
> path/to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
>    job run --name tf-job-001 --verbose --docker_image
> hadoopsubmarine/tf-1.8.0-gpu:0.0.1 \
>    --input_path hdfs://default/dataset/cifar-10-data \
>    --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
>    --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0
>    --num_workers 1 --worker_resources memory=8G,vcores=2,gpu=1 \
>    --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator
> && python cifar10_main.py --data-dir=%input_path%
> --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16
> --train-batch-size=16 --num-gpus=2 --sync" \
>    --tensorboard --tensorboard_docker_image wtan/tf-1.8.0-cpu:0.0.3
> command, I get the following error:2018-11-07 21:48:55,831 INFO  [main]
> client.AHSProxy (AHSProxy.java:createAHSProxy(42)) - Connecting to
> Application History server at /128.105.144.236:10200Exception in thread
> "main" java.lang.IllegalArgumentException: Unacceptable no of cpus
> specified, either zero or negative for component master (or at the global
> level)        at
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateServiceResource(ServiceApiUtil.java:457)
>       at
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateComponent(ServiceApiUtil.java:306)
>       at
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:237)
>       at
> org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:496)
>       at
> org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
>       at
> org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)
>       at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>          at java.lang.reflect.Method.invoke(Method.java:498)        at
> org.apache.hadoop.util.RunJar.run(RunJar.java:323)        at
> org.apache.hadoop.util.RunJar.main(RunJar.java:236)
>
> It seems that I don't configure somewhere some corresponding resources for
> a master component. However I have a hard time understanding where and what
> to configure. I also looked at the design document you pointed at:
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7
>
> and it has a --master_resources flag. However this is not available in
> 3.3.0.
> Could you please advise how to proceed with this?
> Thank you,- Robert
>
>     On Tuesday, November 6, 2018, 10:40:20 PM PST, Jonathan Hung <
> jyhung2357@gmail.com> wrote:
>
>  Hi Robert, I also encourage you to check out
> https://github.com/linkedin/TonY (TensorFlow on YARN) which is a platform
> built for this purpose.
>
> Jonathan
> ________________________________
> From: Sunil G <su...@apache.org>
> Sent: Tuesday, November 6, 2018 10:05:14 PM
> To: Robert Grandl
> Cc: yarn-dev@hadoop.apache.org; yarn-dev-help@hadoop.apache.org; General
> Subject: Re: Run Distributed TensorFlow on YARN
>
> Hi Robert
>
> {Submarine} project helps to run Distributed Tensorflow on top of YARN with
> ease. YARN-8220 <https://issues.apache.org/jira/browse/YARN-8220> was an
> early attempt to do the same with some scripts etc, but Submarine will help
> to avoid all such custom scripts etc, and rather can simply run tensorflow
> like a distributed shell command line by using Submarine jar. Pls refer
> below doc for deep dive.
>
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7
>
> Submarine will be released as part of Hadoop 3.2.0 release which will be
> out very soon officially (in coming weeks). you are free to use hadoop
> trunk to run same if you need very soon.
>
> For now you can refer submarine docs under hadoop repo (trunk)
> under
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/
> or(
>
> https://github.com/apache/hadoop/tree/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown
> )
>
> Thanks
> Sunil
>
>
> On Wed, Nov 7, 2018 at 10:34 AM Robert Grandl <rg...@yahoo.com.invalid>
> wrote:
>
> >  Hi all,
> > I am wondering if there is any stable support to run distributed
> > TensorFlow atop YARN at the moment.
> > I found this blog post from Hortonworks. It seems this it is possible
> > starting YARN 3.1.0.
> >
> https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/
> >
> >
> > Also I found some more recent JIRAs:
> > https://issues.apache.org/jira/browse/YARN-8220
> > https://issues.apache.org/jira/browse/YARN-8135
> > which suggests to use something called submarine.
> >
> > However, I could not find any proper documentation or instructions to use
> > any of these.
> >
> > Can someone help me with this?
> > Otherwise, it is any better support to run any other machine learning
> > framework with YARN?
> > Thank you in advance,- Robert
> >

Re: Run Distributed TensorFlow on YARN

Posted by Robert Grandl <rg...@yahoo.com.INVALID>.
 Thanks a lot for your reply. 
Sunil,
I was trying to follow the steps from: https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/RunningDistributedCifar10TFJobs.md

to run the tensorflow standalone using submarine. I have installed hadoop 3.3.0-SNAPSHOT. 
However, when I run the:yarn jar path/to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
   job run --name tf-job-001 --verbose --docker_image hadoopsubmarine/tf-1.8.0-gpu:0.0.1 \
   --input_path hdfs://default/dataset/cifar-10-data \
   --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
   --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0
   --num_workers 1 --worker_resources memory=8G,vcores=2,gpu=1 \
   --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=2 --sync" \
   --tensorboard --tensorboard_docker_image wtan/tf-1.8.0-cpu:0.0.3
command, I get the following error:2018-11-07 21:48:55,831 INFO  [main] client.AHSProxy (AHSProxy.java:createAHSProxy(42)) - Connecting to Application History server at /128.105.144.236:10200Exception in thread "main" java.lang.IllegalArgumentException: Unacceptable no of cpus specified, either zero or negative for component master (or at the global level)        at org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateServiceResource(ServiceApiUtil.java:457)        at org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateComponent(ServiceApiUtil.java:306)        at org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:237)        at org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:496)        at org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)        at org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)        at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)           at java.lang.reflect.Method.invoke(Method.java:498)        at org.apache.hadoop.util.RunJar.run(RunJar.java:323)        at org.apache.hadoop.util.RunJar.main(RunJar.java:236)

It seems that I don't configure somewhere some corresponding resources for a master component. However I have a hard time understanding where and what to configure. I also looked at the design document you pointed at:https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7

and it has a --master_resources flag. However this is not available in 3.3.0.
Could you please advise how to proceed with this?
Thank you,- Robert

    On Tuesday, November 6, 2018, 10:40:20 PM PST, Jonathan Hung <jy...@gmail.com> wrote:  
 
 Hi Robert, I also encourage you to check out https://github.com/linkedin/TonY (TensorFlow on YARN) which is a platform built for this purpose.

Jonathan
________________________________
From: Sunil G <su...@apache.org>
Sent: Tuesday, November 6, 2018 10:05:14 PM
To: Robert Grandl
Cc: yarn-dev@hadoop.apache.org; yarn-dev-help@hadoop.apache.org; General
Subject: Re: Run Distributed TensorFlow on YARN

Hi Robert

{Submarine} project helps to run Distributed Tensorflow on top of YARN with
ease. YARN-8220 <https://issues.apache.org/jira/browse/YARN-8220> was an
early attempt to do the same with some scripts etc, but Submarine will help
to avoid all such custom scripts etc, and rather can simply run tensorflow
like a distributed shell command line by using Submarine jar. Pls refer
below doc for deep dive.
https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7

Submarine will be released as part of Hadoop 3.2.0 release which will be
out very soon officially (in coming weeks). you are free to use hadoop
trunk to run same if you need very soon.

For now you can refer submarine docs under hadoop repo (trunk)
under hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/
or(
https://github.com/apache/hadoop/tree/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown
)

Thanks
Sunil


On Wed, Nov 7, 2018 at 10:34 AM Robert Grandl <rg...@yahoo.com.invalid>
wrote:

>  Hi all,
> I am wondering if there is any stable support to run distributed
> TensorFlow atop YARN at the moment.
> I found this blog post from Hortonworks. It seems this it is possible
> starting YARN 3.1.0.
> https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/
>
>
> Also I found some more recent JIRAs:
> https://issues.apache.org/jira/browse/YARN-8220
> https://issues.apache.org/jira/browse/YARN-8135
> which suggests to use something called submarine.
>
> However, I could not find any proper documentation or instructions to use
> any of these.
>
> Can someone help me with this?
> Otherwise, it is any better support to run any other machine learning
> framework with YARN?
> Thank you in advance,- Robert
>  

Re: Run Distributed TensorFlow on YARN

Posted by Jonathan Hung <jy...@gmail.com>.
Hi Robert, I also encourage you to check out https://github.com/linkedin/TonY (TensorFlow on YARN) which is a platform built for this purpose.

Jonathan
________________________________
From: Sunil G <su...@apache.org>
Sent: Tuesday, November 6, 2018 10:05:14 PM
To: Robert Grandl
Cc: yarn-dev@hadoop.apache.org; yarn-dev-help@hadoop.apache.org; General
Subject: Re: Run Distributed TensorFlow on YARN

Hi Robert

{Submarine} project helps to run Distributed Tensorflow on top of YARN with
ease. YARN-8220 <https://issues.apache.org/jira/browse/YARN-8220> was an
early attempt to do the same with some scripts etc, but Submarine will help
to avoid all such custom scripts etc, and rather can simply run tensorflow
like a distributed shell command line by using Submarine jar. Pls refer
below doc for deep dive.
https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7

Submarine will be released as part of Hadoop 3.2.0 release which will be
out very soon officially (in coming weeks). you are free to use hadoop
trunk to run same if you need very soon.

For now you can refer submarine docs under hadoop repo (trunk)
under hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/
or(
https://github.com/apache/hadoop/tree/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown
)

Thanks
Sunil


On Wed, Nov 7, 2018 at 10:34 AM Robert Grandl <rg...@yahoo.com.invalid>
wrote:

>  Hi all,
> I am wondering if there is any stable support to run distributed
> TensorFlow atop YARN at the moment.
> I found this blog post from Hortonworks. It seems this it is possible
> starting YARN 3.1.0.
> https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/
>
>
> Also I found some more recent JIRAs:
> https://issues.apache.org/jira/browse/YARN-8220
> https://issues.apache.org/jira/browse/YARN-8135
> which suggests to use something called submarine.
>
> However, I could not find any proper documentation or instructions to use
> any of these.
>
> Can someone help me with this?
> Otherwise, it is any better support to run any other machine learning
> framework with YARN?
> Thank you in advance,- Robert
>

Re: Run Distributed TensorFlow on YARN

Posted by Jonathan Hung <jy...@gmail.com>.
Hi Robert, I also encourage you to check out https://github.com/linkedin/TonY (TensorFlow on YARN) which is a platform built for this purpose.

Jonathan
________________________________
From: Sunil G <su...@apache.org>
Sent: Tuesday, November 6, 2018 10:05:14 PM
To: Robert Grandl
Cc: yarn-dev@hadoop.apache.org; yarn-dev-help@hadoop.apache.org; General
Subject: Re: Run Distributed TensorFlow on YARN

Hi Robert

{Submarine} project helps to run Distributed Tensorflow on top of YARN with
ease. YARN-8220 <https://issues.apache.org/jira/browse/YARN-8220> was an
early attempt to do the same with some scripts etc, but Submarine will help
to avoid all such custom scripts etc, and rather can simply run tensorflow
like a distributed shell command line by using Submarine jar. Pls refer
below doc for deep dive.
https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7

Submarine will be released as part of Hadoop 3.2.0 release which will be
out very soon officially (in coming weeks). you are free to use hadoop
trunk to run same if you need very soon.

For now you can refer submarine docs under hadoop repo (trunk)
under hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/
or(
https://github.com/apache/hadoop/tree/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown
)

Thanks
Sunil


On Wed, Nov 7, 2018 at 10:34 AM Robert Grandl <rg...@yahoo.com.invalid>
wrote:

>  Hi all,
> I am wondering if there is any stable support to run distributed
> TensorFlow atop YARN at the moment.
> I found this blog post from Hortonworks. It seems this it is possible
> starting YARN 3.1.0.
> https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/
>
>
> Also I found some more recent JIRAs:
> https://issues.apache.org/jira/browse/YARN-8220
> https://issues.apache.org/jira/browse/YARN-8135
> which suggests to use something called submarine.
>
> However, I could not find any proper documentation or instructions to use
> any of these.
>
> Can someone help me with this?
> Otherwise, it is any better support to run any other machine learning
> framework with YARN?
> Thank you in advance,- Robert
>

Re: Run Distributed TensorFlow on YARN

Posted by Sunil G <su...@apache.org>.
Hi Robert

{Submarine} project helps to run Distributed Tensorflow on top of YARN with
ease. YARN-8220 <https://issues.apache.org/jira/browse/YARN-8220> was an
early attempt to do the same with some scripts etc, but Submarine will help
to avoid all such custom scripts etc, and rather can simply run tensorflow
like a distributed shell command line by using Submarine jar. Pls refer
below doc for deep dive.
https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7

Submarine will be released as part of Hadoop 3.2.0 release which will be
out very soon officially (in coming weeks). you are free to use hadoop
trunk to run same if you need very soon.

For now you can refer submarine docs under hadoop repo (trunk)
under hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/
or(
https://github.com/apache/hadoop/tree/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown
)

Thanks
Sunil


On Wed, Nov 7, 2018 at 10:34 AM Robert Grandl <rg...@yahoo.com.invalid>
wrote:

>  Hi all,
> I am wondering if there is any stable support to run distributed
> TensorFlow atop YARN at the moment.
> I found this blog post from Hortonworks. It seems this it is possible
> starting YARN 3.1.0.
> https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/
>
>
> Also I found some more recent JIRAs:
> https://issues.apache.org/jira/browse/YARN-8220
> https://issues.apache.org/jira/browse/YARN-8135
> which suggests to use something called submarine.
>
> However, I could not find any proper documentation or instructions to use
> any of these.
>
> Can someone help me with this?
> Otherwise, it is any better support to run any other machine learning
> framework with YARN?
> Thank you in advance,- Robert
>

Re: Run Distributed TensorFlow on YARN

Posted by Sunil G <su...@apache.org>.
Hi Robert

{Submarine} project helps to run Distributed Tensorflow on top of YARN with
ease. YARN-8220 <https://issues.apache.org/jira/browse/YARN-8220> was an
early attempt to do the same with some scripts etc, but Submarine will help
to avoid all such custom scripts etc, and rather can simply run tensorflow
like a distributed shell command line by using Submarine jar. Pls refer
below doc for deep dive.
https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7

Submarine will be released as part of Hadoop 3.2.0 release which will be
out very soon officially (in coming weeks). you are free to use hadoop
trunk to run same if you need very soon.

For now you can refer submarine docs under hadoop repo (trunk)
under hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/
or(
https://github.com/apache/hadoop/tree/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown
)

Thanks
Sunil


On Wed, Nov 7, 2018 at 10:34 AM Robert Grandl <rg...@yahoo.com.invalid>
wrote:

>  Hi all,
> I am wondering if there is any stable support to run distributed
> TensorFlow atop YARN at the moment.
> I found this blog post from Hortonworks. It seems this it is possible
> starting YARN 3.1.0.
> https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/
>
>
> Also I found some more recent JIRAs:
> https://issues.apache.org/jira/browse/YARN-8220
> https://issues.apache.org/jira/browse/YARN-8135
> which suggests to use something called submarine.
>
> However, I could not find any proper documentation or instructions to use
> any of these.
>
> Can someone help me with this?
> Otherwise, it is any better support to run any other machine learning
> framework with YARN?
> Thank you in advance,- Robert
>