You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Vinay Kashyap <vi...@gmail.com> on 2019/02/22 07:34:37 UTC

Hadoop 3.2.0 {Submarine} : Understanding HDFS data Read/Write during/after application launch/execution

Hi all,

I am using Hadoop 3.2.0. I am trying few examples using Submarine to run
TensorFlow jobs in a docker container.
I would like to understand few details regarding Read/Write HDFS data
during/after application launch/execution. Have highlighted the questions
line.

When launching the application which reads input from HDFS, we configure
*--input_path* to a hdfs path, as mentioned in the standard example.

yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \
 --name tf-job-001 --docker_image <your docker image> \
 --input_path hdfs://default/dataset/cifar-10-data \
 --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
 --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
 --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
 --num_workers 2 \
 --worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cmd for
worker ..." \
 --num_ps 2 \
 --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" \

*Question 1 : What if I have more than 1 dataset in a separate HDFS paths?
Can --input_path take multiple paths in any fashion or is it expected to
maintain all the datasets under one path.?*

"DOCKER_JAVA_HOME points to JAVA_HOME inside Docker image"
and "DOCKER_HADOOP_HDFS_HOME points to HADOOP_HDFS_HOME inside Docker
image".

*Question 2 : What is the exact expectation here.? In the sense, is there
any relation/connection with the Hadoop running outside the docker
container.? I guess read HDFS data into the docker container happens during
Container localization, but how does output data write back happens to HDFS
running outside the docker container.?*

Assuming a scenario where Application 1 creates a model and Application 2
performs scoring. Both the applications run in a separate docker
containers. I would like the understand how does the data read and write
across applications happen in this case.
Would be of great help if anyone can be guide me understanding this or
direct me to a blog or write up which explains the above.

*Thanks and regards*
*Vinay Kashyap*

Re: Hadoop 3.2.0 {Submarine} : Understanding HDFS data Read/Write during/after application launch/execution

Posted by Vinay Kashyap <vi...@gmail.com>.

Thanks Zhankun for the clarification.
Also, is my understanding correct on --checkpoint_path as I mentioned
earlier in the thread..?? Quoting the comment again in this thread.

[There is another argument called *--checkpoint_path* which acts as a path
where all the outputs (models or datasets) which are resulted as part of
the execution of the worker code inside the docker container. Hence,
*--input_path* acts as entry point which will be localized and
*--checkpoint_path *acts as exit point, where both these paths are hdfs
paths which runs outside the docker container.]

Will continue my exercise with Submarine and would love to discuss more.



On Mon, Feb 25, 2019 at 4:21 PM zhankun tang <ta...@gmail.com> wrote:

> Hi Vinay,
>
> IIRC, YARN will have the host's Hadoop environments set in container
> launch script by default. And in the submarine case, the user's worker
> command is used to generate a worker script which is invoked in the
> container launch script. If submarine doesn't override the default Hadoop
> environment variable, the HDFS read/write in the container might fail due
> to not found or incorrect Hadoop location.
> So even a Docker image is built with correct Hadoop environment set, it
> seems also needs this override to use HDFS library in a container. This
> seems caused by YARN's Docker support and the submarine is doing a
> workaround here.
>
> The submarine is evolving rapidly, please share your thoughts if it's uncomfortable
> for you.
>
> Thanks,
> Zhankun
>
> On Mon, 25 Feb 2019 at 12:22, Vinay Kashyap <vi...@gmail.com> wrote:
>
>> Hi Zhankun,
>> Thanks for the reply.
>>
>> Regarding Question 1 : Okay.. I understand, Let me try configuring
>> multiple input path place holders and refer the same in the worker launch
>> command.
>>
>> Regarding Question 2 :
>> What I did not understand is why YARN has to set anything related to
>> Hadoop which runs inside the container. The Hadoop environment and the
>> worker code to read the same is completely isolated to the docker
>> container. In that case, the worker scripts should know where the
>> HADOOP_HOME is inside the container right.? There is another argument
>> called *--checkpoint_path* which acts as a path where all the outputs
>> (models or datasets) which are resulted as part of the execution of the
>> worker code inside the docker container. Hence, *--input_path* acts as
>> entry point which will be localized and *--checkpoint_path *acts as exit
>> point, where both these paths are hdfs paths which runs outside the docker
>> container. So why YARN should know the hadoop configuration which is inside
>> the container.?
>>
>> Thanks and regards
>> Vinay Kashyap
>>
>> On Fri, Feb 22, 2019 at 7:39 PM zhankun tang <ta...@gmail.com>
>> wrote:
>>
>>> Hi Vinay,
>>>
>>> For question one, IIRC, we cannot set multiple "*--input_path" *flag at
>>> present. The "--input_path" is designed originally as a placeholder to
>>> store a path and then the path is used to replace "%input_path%" in worker
>>> command like "python worker.sh -input %input_path% ..".
>>> So from this perspective, you can directly append the other input paths
>>> to your worker command in your own way.
>>>
>>> For question two, because YARN might set a wrong HADOOP_COMMON_HOME by
>>> default. So submarine provides the environment variable to be set in the
>>> worker's launch script if the worker wants to access HDFS.
>>> And there's no data plane relation between outside Hadoop and the
>>> container except YARN will localize resources for the container.
>>>
>>> Hope this can answer your questions.
>>>
>>> Best Regards,
>>> Zhankun
>>>
>>> On Fri, 22 Feb 2019 at 15:35, Vinay Kashyap <vi...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am using Hadoop 3.2.0. I am trying few examples using Submarine to
>>>> run TensorFlow jobs in a docker container.
>>>> I would like to understand few details regarding Read/Write HDFS data
>>>> during/after application launch/execution. Have highlighted the questions
>>>> line.
>>>>
>>>> When launching the application which reads input from HDFS, we
>>>> configure *--input_path* to a hdfs path, as mentioned in the standard
>>>> example.
>>>>
>>>> yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \
>>>>  --name tf-job-001 --docker_image <your docker image> \
>>>>  --input_path hdfs://default/dataset/cifar-10-data \
>>>>  --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
>>>>  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
>>>>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
>>>>  --num_workers 2 \
>>>>  --worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cmd
>>>> for worker ..." \
>>>>  --num_ps 2 \
>>>>  --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" \
>>>>
>>>> *Question 1 : What if I have more than 1 dataset in a separate HDFS
>>>> paths? Can --input_path take multiple paths in any fashion or is it
>>>> expected to maintain all the datasets under one path.?*
>>>>
>>>> "DOCKER_JAVA_HOME points to JAVA_HOME inside Docker image"
>>>> and "DOCKER_HADOOP_HDFS_HOME points to HADOOP_HDFS_HOME inside Docker
>>>> image".
>>>>
>>>> *Question 2 : What is the exact expectation here.? In the sense, is
>>>> there any relation/connection with the Hadoop running outside the docker
>>>> container.? I guess read HDFS data into the docker container happens during
>>>> Container localization, but how does output data write back happens to HDFS
>>>> running outside the docker container.?*
>>>>
>>>> Assuming a scenario where Application 1 creates a model and Application
>>>> 2 performs scoring. Both the applications run in a separate docker
>>>> containers. I would like the understand how does the data read and write
>>>> across applications happen in this case.
>>>> Would be of great help if anyone can be guide me understanding this or
>>>> direct me to a blog or write up which explains the above.
>>>>
>>>> *Thanks and regards*
>>>> *Vinay Kashyap*
>>>>
>>>
>>
>> --
>> *Thanks and regards*
>> *Vinay Kashyap*
>>
>

-- 
*Thanks and regards*
*Vinay Kashyap*

Re: Hadoop 3.2.0 {Submarine} : Understanding HDFS data Read/Write during/after application launch/execution

Posted by zhankun tang <ta...@gmail.com>.

Hi Vinay,

IIRC, YARN will have the host's Hadoop environments set in container launch
script by default. And in the submarine case, the user's worker command is
used to generate a worker script which is invoked in the container launch
script. If submarine doesn't override the default Hadoop environment
variable, the HDFS read/write in the container might fail due to not found
or incorrect Hadoop location.
So even a Docker image is built with correct Hadoop environment set, it
seems also needs this override to use HDFS library in a container. This
seems caused by YARN's Docker support and the submarine is doing a
workaround here.

The submarine is evolving rapidly, please share your thoughts if it's
uncomfortable
for you.

Thanks,
Zhankun

On Mon, 25 Feb 2019 at 12:22, Vinay Kashyap <vi...@gmail.com> wrote:

> Hi Zhankun,
> Thanks for the reply.
>
> Regarding Question 1 : Okay.. I understand, Let me try configuring
> multiple input path place holders and refer the same in the worker launch
> command.
>
> Regarding Question 2 :
> What I did not understand is why YARN has to set anything related to
> Hadoop which runs inside the container. The Hadoop environment and the
> worker code to read the same is completely isolated to the docker
> container. In that case, the worker scripts should know where the
> HADOOP_HOME is inside the container right.? There is another argument
> called *--checkpoint_path* which acts as a path where all the outputs
> (models or datasets) which are resulted as part of the execution of the
> worker code inside the docker container. Hence, *--input_path* acts as
> entry point which will be localized and *--checkpoint_path *acts as exit
> point, where both these paths are hdfs paths which runs outside the docker
> container. So why YARN should know the hadoop configuration which is inside
> the container.?
>
> Thanks and regards
> Vinay Kashyap
>
> On Fri, Feb 22, 2019 at 7:39 PM zhankun tang <ta...@gmail.com>
> wrote:
>
>> Hi Vinay,
>>
>> For question one, IIRC, we cannot set multiple "*--input_path" *flag at
>> present. The "--input_path" is designed originally as a placeholder to
>> store a path and then the path is used to replace "%input_path%" in worker
>> command like "python worker.sh -input %input_path% ..".
>> So from this perspective, you can directly append the other input paths
>> to your worker command in your own way.
>>
>> For question two, because YARN might set a wrong HADOOP_COMMON_HOME by
>> default. So submarine provides the environment variable to be set in the
>> worker's launch script if the worker wants to access HDFS.
>> And there's no data plane relation between outside Hadoop and the
>> container except YARN will localize resources for the container.
>>
>> Hope this can answer your questions.
>>
>> Best Regards,
>> Zhankun
>>
>> On Fri, 22 Feb 2019 at 15:35, Vinay Kashyap <vi...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I am using Hadoop 3.2.0. I am trying few examples using Submarine to run
>>> TensorFlow jobs in a docker container.
>>> I would like to understand few details regarding Read/Write HDFS data
>>> during/after application launch/execution. Have highlighted the questions
>>> line.
>>>
>>> When launching the application which reads input from HDFS, we configure
>>> *--input_path* to a hdfs path, as mentioned in the standard example.
>>>
>>> yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \
>>>  --name tf-job-001 --docker_image <your docker image> \
>>>  --input_path hdfs://default/dataset/cifar-10-data \
>>>  --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
>>>  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
>>>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
>>>  --num_workers 2 \
>>>  --worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cmd
>>> for worker ..." \
>>>  --num_ps 2 \
>>>  --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" \
>>>
>>> *Question 1 : What if I have more than 1 dataset in a separate HDFS
>>> paths? Can --input_path take multiple paths in any fashion or is it
>>> expected to maintain all the datasets under one path.?*
>>>
>>> "DOCKER_JAVA_HOME points to JAVA_HOME inside Docker image"
>>> and "DOCKER_HADOOP_HDFS_HOME points to HADOOP_HDFS_HOME inside Docker
>>> image".
>>>
>>> *Question 2 : What is the exact expectation here.? In the sense, is
>>> there any relation/connection with the Hadoop running outside the docker
>>> container.? I guess read HDFS data into the docker container happens during
>>> Container localization, but how does output data write back happens to HDFS
>>> running outside the docker container.?*
>>>
>>> Assuming a scenario where Application 1 creates a model and Application
>>> 2 performs scoring. Both the applications run in a separate docker
>>> containers. I would like the understand how does the data read and write
>>> across applications happen in this case.
>>> Would be of great help if anyone can be guide me understanding this or
>>> direct me to a blog or write up which explains the above.
>>>
>>> *Thanks and regards*
>>> *Vinay Kashyap*
>>>
>>
>
> --
> *Thanks and regards*
> *Vinay Kashyap*
>

Re: Hadoop 3.2.0 {Submarine} : Understanding HDFS data Read/Write during/after application launch/execution

Posted by Vinay Kashyap <vi...@gmail.com>.

Hi Zhankun,
Thanks for the reply.

Regarding Question 1 : Okay.. I understand, Let me try configuring multiple
input path place holders and refer the same in the worker launch command.

Regarding Question 2 :
What I did not understand is why YARN has to set anything related to Hadoop
which runs inside the container. The Hadoop environment and the worker code
to read the same is completely isolated to the docker container. In that
case, the worker scripts should know where the HADOOP_HOME is inside the
container right.? There is another argument called *--checkpoint_path*
which acts as a path where all the outputs (models or datasets) which are
resulted as part of the execution of the worker code inside the docker
container. Hence, *--input_path* acts as entry point which will be
localized and *--checkpoint_path *acts as exit point, where both these
paths are hdfs paths which runs outside the docker container. So why YARN
should know the hadoop configuration which is inside the container.?

Thanks and regards
Vinay Kashyap

On Fri, Feb 22, 2019 at 7:39 PM zhankun tang <ta...@gmail.com> wrote:

> Hi Vinay,
>
> For question one, IIRC, we cannot set multiple "*--input_path" *flag at
> present. The "--input_path" is designed originally as a placeholder to
> store a path and then the path is used to replace "%input_path%" in worker
> command like "python worker.sh -input %input_path% ..".
> So from this perspective, you can directly append the other input paths to
> your worker command in your own way.
>
> For question two, because YARN might set a wrong HADOOP_COMMON_HOME by
> default. So submarine provides the environment variable to be set in the
> worker's launch script if the worker wants to access HDFS.
> And there's no data plane relation between outside Hadoop and the
> container except YARN will localize resources for the container.
>
> Hope this can answer your questions.
>
> Best Regards,
> Zhankun
>
> On Fri, 22 Feb 2019 at 15:35, Vinay Kashyap <vi...@gmail.com> wrote:
>
>> Hi all,
>>
>> I am using Hadoop 3.2.0. I am trying few examples using Submarine to run
>> TensorFlow jobs in a docker container.
>> I would like to understand few details regarding Read/Write HDFS data
>> during/after application launch/execution. Have highlighted the questions
>> line.
>>
>> When launching the application which reads input from HDFS, we configure
>> *--input_path* to a hdfs path, as mentioned in the standard example.
>>
>> yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \
>>  --name tf-job-001 --docker_image <your docker image> \
>>  --input_path hdfs://default/dataset/cifar-10-data \
>>  --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
>>  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
>>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
>>  --num_workers 2 \
>>  --worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cmd for
>> worker ..." \
>>  --num_ps 2 \
>>  --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" \
>>
>> *Question 1 : What if I have more than 1 dataset in a separate HDFS
>> paths? Can --input_path take multiple paths in any fashion or is it
>> expected to maintain all the datasets under one path.?*
>>
>> "DOCKER_JAVA_HOME points to JAVA_HOME inside Docker image"
>> and "DOCKER_HADOOP_HDFS_HOME points to HADOOP_HDFS_HOME inside Docker
>> image".
>>
>> *Question 2 : What is the exact expectation here.? In the sense, is there
>> any relation/connection with the Hadoop running outside the docker
>> container.? I guess read HDFS data into the docker container happens during
>> Container localization, but how does output data write back happens to HDFS
>> running outside the docker container.?*
>>
>> Assuming a scenario where Application 1 creates a model and Application 2
>> performs scoring. Both the applications run in a separate docker
>> containers. I would like the understand how does the data read and write
>> across applications happen in this case.
>> Would be of great help if anyone can be guide me understanding this or
>> direct me to a blog or write up which explains the above.
>>
>> *Thanks and regards*
>> *Vinay Kashyap*
>>
>

-- 
*Thanks and regards*
*Vinay Kashyap*

Re: Hadoop 3.2.0 {Submarine} : Understanding HDFS data Read/Write during/after application launch/execution

Posted by zhankun tang <ta...@gmail.com>.

Hi Vinay,

For question one, IIRC, we cannot set multiple "*--input_path" *flag at
present. The "--input_path" is designed originally as a placeholder to
store a path and then the path is used to replace "%input_path%" in worker
command like "python worker.sh -input %input_path% ..".
So from this perspective, you can directly append the other input paths to
your worker command in your own way.

For question two, because YARN might set a wrong HADOOP_COMMON_HOME by
default. So submarine provides the environment variable to be set in the
worker's launch script if the worker wants to access HDFS.
And there's no data plane relation between outside Hadoop and the container
except YARN will localize resources for the container.

Hope this can answer your questions.

Best Regards,
Zhankun

On Fri, 22 Feb 2019 at 15:35, Vinay Kashyap <vi...@gmail.com> wrote:

> Hi all,
>
> I am using Hadoop 3.2.0. I am trying few examples using Submarine to run
> TensorFlow jobs in a docker container.
> I would like to understand few details regarding Read/Write HDFS data
> during/after application launch/execution. Have highlighted the questions
> line.
>
> When launching the application which reads input from HDFS, we configure
> *--input_path* to a hdfs path, as mentioned in the standard example.
>
> yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \
>  --name tf-job-001 --docker_image <your docker image> \
>  --input_path hdfs://default/dataset/cifar-10-data \
>  --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
>  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
>  --num_workers 2 \
>  --worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cmd for
> worker ..." \
>  --num_ps 2 \
>  --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" \
>
> *Question 1 : What if I have more than 1 dataset in a separate HDFS paths?
> Can --input_path take multiple paths in any fashion or is it expected to
> maintain all the datasets under one path.?*
>
> "DOCKER_JAVA_HOME points to JAVA_HOME inside Docker image"
> and "DOCKER_HADOOP_HDFS_HOME points to HADOOP_HDFS_HOME inside Docker
> image".
>
> *Question 2 : What is the exact expectation here.? In the sense, is there
> any relation/connection with the Hadoop running outside the docker
> container.? I guess read HDFS data into the docker container happens during
> Container localization, but how does output data write back happens to HDFS
> running outside the docker container.?*
>
> Assuming a scenario where Application 1 creates a model and Application 2
> performs scoring. Both the applications run in a separate docker
> containers. I would like the understand how does the data read and write
> across applications happen in this case.
> Would be of great help if anyone can be guide me understanding this or
> direct me to a blog or write up which explains the above.
>
> *Thanks and regards*
> *Vinay Kashyap*
>