You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by John Lilley <jo...@redpoint.net> on 2013/10/20 17:24:19 UTC

temporary file locations for YARN applications

We have a pure YARN application (no MapReduce) that has need to store a significant amount of temporary data.  How can we know the best location for these files?  How can we ensure that our YARN tasks have write access to these locations?  Is this something that must be configured outside of YARN?
Thanks,
John



RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
By the way, it seems that this ended up being a hard-coded environment variable name "LOCAL_DIRS" instead of ApplicationConstants.LOCAL_DIR_ENV, which we can't see defined anywhere.
John


-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Monday, October 21, 2013 12:11 PM
To: <us...@hadoop.apache.org>
Subject: Re: temporary file locations for YARN applications

The dirs in that env-var are app-specific and are for the app's user to utilize. You shouldn't have any permission issues working within them.

The LocalDirAllocator is still somewhat MR-bound but you can still be able to make it work by giving it a config with the values it needs.

On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net> wrote:
> Thanks again.  This gives me a lot of options; we will see what works.
>
> Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?
>
> Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?
>
> Cheers,
> john
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 11:58 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Hi,
>
> MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.
>
> YARN sets an environment property thats a list (comma separated value) 
> of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container 
> can together use. Perhaps read it in with 
> StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL
> _DIR_ENV)); and then round robin internally over those paths (with 
> free space handling)?
>
> Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.
>
> On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
>> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>>
>> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>>
>> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>>
>> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>>
>> Thanks
>> John
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Sunday, October 20, 2013 10:49 AM
>> To: <us...@hadoop.apache.org>
>> Subject: Re: temporary file locations for YARN applications
>>
>> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
>> This is the best place to store the temporary files. This is not something you need custom configuration for.
>>
>> Do the files need to be on a distributed FS or a local one?
>>
>> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>>> We have a pure YARN application (no MapReduce) that has need to 
>>> store a significant amount of temporary data.  How can we know the 
>>> best location for these files?  How can we ensure that our YARN 
>>> tasks have write access to these locations?  Is this something that must be configured outside of YARN?
>>> Thanks,
>>> John
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



--
Harsh J

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
Thanks, sounds like LOCAL_DIR_ENV is the way to go.
john

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Monday, October 21, 2013 12:11 PM
To: <us...@hadoop.apache.org>
Subject: Re: temporary file locations for YARN applications

The dirs in that env-var are app-specific and are for the app's user to utilize. You shouldn't have any permission issues working within them.

The LocalDirAllocator is still somewhat MR-bound but you can still be able to make it work by giving it a config with the values it needs.

On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net> wrote:
> Thanks again.  This gives me a lot of options; we will see what works.
>
> Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?
>
> Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?
>
> Cheers,
> john
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 11:58 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Hi,
>
> MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.
>
> YARN sets an environment property thats a list (comma separated value) 
> of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container 
> can together use. Perhaps read it in with 
> StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL
> _DIR_ENV)); and then round robin internally over those paths (with 
> free space handling)?
>
> Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.
>
> On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
>> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>>
>> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>>
>> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>>
>> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>>
>> Thanks
>> John
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Sunday, October 20, 2013 10:49 AM
>> To: <us...@hadoop.apache.org>
>> Subject: Re: temporary file locations for YARN applications
>>
>> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
>> This is the best place to store the temporary files. This is not something you need custom configuration for.
>>
>> Do the files need to be on a distributed FS or a local one?
>>
>> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>>> We have a pure YARN application (no MapReduce) that has need to 
>>> store a significant amount of temporary data.  How can we know the 
>>> best location for these files?  How can we ensure that our YARN 
>>> tasks have write access to these locations?  Is this something that must be configured outside of YARN?
>>> Thanks,
>>> John
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



--
Harsh J

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
Thanks, sounds like LOCAL_DIR_ENV is the way to go.
john

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Monday, October 21, 2013 12:11 PM
To: <us...@hadoop.apache.org>
Subject: Re: temporary file locations for YARN applications

The dirs in that env-var are app-specific and are for the app's user to utilize. You shouldn't have any permission issues working within them.

The LocalDirAllocator is still somewhat MR-bound but you can still be able to make it work by giving it a config with the values it needs.

On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net> wrote:
> Thanks again.  This gives me a lot of options; we will see what works.
>
> Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?
>
> Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?
>
> Cheers,
> john
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 11:58 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Hi,
>
> MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.
>
> YARN sets an environment property thats a list (comma separated value) 
> of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container 
> can together use. Perhaps read it in with 
> StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL
> _DIR_ENV)); and then round robin internally over those paths (with 
> free space handling)?
>
> Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.
>
> On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
>> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>>
>> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>>
>> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>>
>> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>>
>> Thanks
>> John
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Sunday, October 20, 2013 10:49 AM
>> To: <us...@hadoop.apache.org>
>> Subject: Re: temporary file locations for YARN applications
>>
>> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
>> This is the best place to store the temporary files. This is not something you need custom configuration for.
>>
>> Do the files need to be on a distributed FS or a local one?
>>
>> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>>> We have a pure YARN application (no MapReduce) that has need to 
>>> store a significant amount of temporary data.  How can we know the 
>>> best location for these files?  How can we ensure that our YARN 
>>> tasks have write access to these locations?  Is this something that must be configured outside of YARN?
>>> Thanks,
>>> John
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



--
Harsh J

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
Right, that's very useful for ensuring that copies of read-only data are available to all nodes.  We do use LocalResources for the transport of our executable environment to the nodes.
Cheers,
John


From: Jian He [mailto:jhe@hortonworks.com]
Sent: Monday, October 21, 2013 12:22 PM
To: user@hadoop.apache.org
Subject: Re: temporary file locations for YARN applications

This post might help a bit.
http://hortonworks.com/blog/management-of-application-dependencies-in-yarn/

Thanks,
Jian

On Mon, Oct 21, 2013 at 11:11 AM, Harsh J <ha...@cloudera.com>> wrote:
The dirs in that env-var are app-specific and are for the app's user
to utilize. You shouldn't have any permission issues working within
them.

The LocalDirAllocator is still somewhat MR-bound but you can still be
able to make it work by giving it a config with the values it needs.

On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net>> wrote:
> Thanks again.  This gives me a lot of options; we will see what works.
>
> Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?
>
> Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?
>
> Cheers,
> john
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com<ma...@cloudera.com>]
> Sent: Sunday, October 20, 2013 11:58 PM
> To: <us...@hadoop.apache.org>>
> Subject: Re: temporary file locations for YARN applications
>
> Hi,
>
> MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.
>
> YARN sets an environment property thats a list (comma separated value) of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container can together use. Perhaps read it in with StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
> and then round robin internally over those paths (with free space handling)?
>
> Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.
>
> On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net>> wrote:
>> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>>
>> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>>
>> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>>
>> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>>
>> Thanks
>> John
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com<ma...@cloudera.com>]
>> Sent: Sunday, October 20, 2013 10:49 AM
>> To: <us...@hadoop.apache.org>>
>> Subject: Re: temporary file locations for YARN applications
>>
>> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
>> This is the best place to store the temporary files. This is not something you need custom configuration for.
>>
>> Do the files need to be on a distributed FS or a local one?
>>
>> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net>> wrote:
>>> We have a pure YARN application (no MapReduce) that has need to store
>>> a significant amount of temporary data.  How can we know the best
>>> location for these files?  How can we ensure that our YARN tasks have
>>> write access to these locations?  Is this something that must be configured outside of YARN?
>>> Thanks,
>>> John
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J


--
Harsh J


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
Right, that's very useful for ensuring that copies of read-only data are available to all nodes.  We do use LocalResources for the transport of our executable environment to the nodes.
Cheers,
John


From: Jian He [mailto:jhe@hortonworks.com]
Sent: Monday, October 21, 2013 12:22 PM
To: user@hadoop.apache.org
Subject: Re: temporary file locations for YARN applications

This post might help a bit.
http://hortonworks.com/blog/management-of-application-dependencies-in-yarn/

Thanks,
Jian

On Mon, Oct 21, 2013 at 11:11 AM, Harsh J <ha...@cloudera.com>> wrote:
The dirs in that env-var are app-specific and are for the app's user
to utilize. You shouldn't have any permission issues working within
them.

The LocalDirAllocator is still somewhat MR-bound but you can still be
able to make it work by giving it a config with the values it needs.

On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net>> wrote:
> Thanks again.  This gives me a lot of options; we will see what works.
>
> Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?
>
> Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?
>
> Cheers,
> john
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com<ma...@cloudera.com>]
> Sent: Sunday, October 20, 2013 11:58 PM
> To: <us...@hadoop.apache.org>>
> Subject: Re: temporary file locations for YARN applications
>
> Hi,
>
> MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.
>
> YARN sets an environment property thats a list (comma separated value) of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container can together use. Perhaps read it in with StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
> and then round robin internally over those paths (with free space handling)?
>
> Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.
>
> On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net>> wrote:
>> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>>
>> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>>
>> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>>
>> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>>
>> Thanks
>> John
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com<ma...@cloudera.com>]
>> Sent: Sunday, October 20, 2013 10:49 AM
>> To: <us...@hadoop.apache.org>>
>> Subject: Re: temporary file locations for YARN applications
>>
>> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
>> This is the best place to store the temporary files. This is not something you need custom configuration for.
>>
>> Do the files need to be on a distributed FS or a local one?
>>
>> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net>> wrote:
>>> We have a pure YARN application (no MapReduce) that has need to store
>>> a significant amount of temporary data.  How can we know the best
>>> location for these files?  How can we ensure that our YARN tasks have
>>> write access to these locations?  Is this something that must be configured outside of YARN?
>>> Thanks,
>>> John
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J


--
Harsh J


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
Right, that's very useful for ensuring that copies of read-only data are available to all nodes.  We do use LocalResources for the transport of our executable environment to the nodes.
Cheers,
John


From: Jian He [mailto:jhe@hortonworks.com]
Sent: Monday, October 21, 2013 12:22 PM
To: user@hadoop.apache.org
Subject: Re: temporary file locations for YARN applications

This post might help a bit.
http://hortonworks.com/blog/management-of-application-dependencies-in-yarn/

Thanks,
Jian

On Mon, Oct 21, 2013 at 11:11 AM, Harsh J <ha...@cloudera.com>> wrote:
The dirs in that env-var are app-specific and are for the app's user
to utilize. You shouldn't have any permission issues working within
them.

The LocalDirAllocator is still somewhat MR-bound but you can still be
able to make it work by giving it a config with the values it needs.

On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net>> wrote:
> Thanks again.  This gives me a lot of options; we will see what works.
>
> Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?
>
> Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?
>
> Cheers,
> john
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com<ma...@cloudera.com>]
> Sent: Sunday, October 20, 2013 11:58 PM
> To: <us...@hadoop.apache.org>>
> Subject: Re: temporary file locations for YARN applications
>
> Hi,
>
> MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.
>
> YARN sets an environment property thats a list (comma separated value) of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container can together use. Perhaps read it in with StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
> and then round robin internally over those paths (with free space handling)?
>
> Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.
>
> On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net>> wrote:
>> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>>
>> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>>
>> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>>
>> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>>
>> Thanks
>> John
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com<ma...@cloudera.com>]
>> Sent: Sunday, October 20, 2013 10:49 AM
>> To: <us...@hadoop.apache.org>>
>> Subject: Re: temporary file locations for YARN applications
>>
>> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
>> This is the best place to store the temporary files. This is not something you need custom configuration for.
>>
>> Do the files need to be on a distributed FS or a local one?
>>
>> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net>> wrote:
>>> We have a pure YARN application (no MapReduce) that has need to store
>>> a significant amount of temporary data.  How can we know the best
>>> location for these files?  How can we ensure that our YARN tasks have
>>> write access to these locations?  Is this something that must be configured outside of YARN?
>>> Thanks,
>>> John
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J


--
Harsh J


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
Right, that's very useful for ensuring that copies of read-only data are available to all nodes.  We do use LocalResources for the transport of our executable environment to the nodes.
Cheers,
John


From: Jian He [mailto:jhe@hortonworks.com]
Sent: Monday, October 21, 2013 12:22 PM
To: user@hadoop.apache.org
Subject: Re: temporary file locations for YARN applications

This post might help a bit.
http://hortonworks.com/blog/management-of-application-dependencies-in-yarn/

Thanks,
Jian

On Mon, Oct 21, 2013 at 11:11 AM, Harsh J <ha...@cloudera.com>> wrote:
The dirs in that env-var are app-specific and are for the app's user
to utilize. You shouldn't have any permission issues working within
them.

The LocalDirAllocator is still somewhat MR-bound but you can still be
able to make it work by giving it a config with the values it needs.

On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net>> wrote:
> Thanks again.  This gives me a lot of options; we will see what works.
>
> Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?
>
> Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?
>
> Cheers,
> john
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com<ma...@cloudera.com>]
> Sent: Sunday, October 20, 2013 11:58 PM
> To: <us...@hadoop.apache.org>>
> Subject: Re: temporary file locations for YARN applications
>
> Hi,
>
> MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.
>
> YARN sets an environment property thats a list (comma separated value) of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container can together use. Perhaps read it in with StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
> and then round robin internally over those paths (with free space handling)?
>
> Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.
>
> On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net>> wrote:
>> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>>
>> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>>
>> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>>
>> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>>
>> Thanks
>> John
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com<ma...@cloudera.com>]
>> Sent: Sunday, October 20, 2013 10:49 AM
>> To: <us...@hadoop.apache.org>>
>> Subject: Re: temporary file locations for YARN applications
>>
>> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
>> This is the best place to store the temporary files. This is not something you need custom configuration for.
>>
>> Do the files need to be on a distributed FS or a local one?
>>
>> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net>> wrote:
>>> We have a pure YARN application (no MapReduce) that has need to store
>>> a significant amount of temporary data.  How can we know the best
>>> location for these files?  How can we ensure that our YARN tasks have
>>> write access to these locations?  Is this something that must be configured outside of YARN?
>>> Thanks,
>>> John
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J


--
Harsh J


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

Re: temporary file locations for YARN applications

Posted by Jian He <jh...@hortonworks.com>.
This post might help a bit.
http://hortonworks.com/blog/management-of-application-dependencies-in-yarn/

Thanks,
Jian


On Mon, Oct 21, 2013 at 11:11 AM, Harsh J <ha...@cloudera.com> wrote:

> The dirs in that env-var are app-specific and are for the app's user
> to utilize. You shouldn't have any permission issues working within
> them.
>
> The LocalDirAllocator is still somewhat MR-bound but you can still be
> able to make it work by giving it a config with the values it needs.
>
> On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net>
> wrote:
> > Thanks again.  This gives me a lot of options; we will see what works.
> >
> > Do you know if there are any permissions issues if we directly access
> the folders of LOCAL_DIR_ENV?
> >
> > Regarding LocalDirAllocator, I see its constructor:
> LocalDirAllocator(String contextCfgItemName) and a note mentioning that an
> example of this item is "mapred.local.dir".  Is that the correct usage, or
> is there something YARN-generic?
> >
> > Cheers,
> > john
> >
> > -----Original Message-----
> > From: Harsh J [mailto:harsh@cloudera.com]
> > Sent: Sunday, October 20, 2013 11:58 PM
> > To: <us...@hadoop.apache.org>
> > Subject: Re: temporary file locations for YARN applications
> >
> > Hi,
> >
> > MR does use multiple disks when spilling. But the work directory is also
> round-robined to spread I/O.
> >
> > YARN sets an environment property thats a list (comma separated value)
> of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container can
> together use. Perhaps read it in with
> StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
> > and then round robin internally over those paths (with free space
> handling)?
> >
> > Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> > class; which is what MR uses. Its not been declared publicly stable
> though, but we can do that over a JIRA.
> >
> > On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net>
> wrote:
> >> Harsh, thanks for the quick response.  These files don't need to be on
> the DFS (although we use that too).  These are local files used during
> sorting, joining, transitive closure.
> >>
> >> The task-relative folder might be good enough, but our app *can* make
> use of multiple temp folders if they are available.  Our YARN app can be
> fairly I/O intensive; is it possible to allocate more than one temp folder
> on different physical devices?
> >>
> >> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on
> different disks so that they do not compete with each other on I/O?
> >>
> >> For that matter, where does MR allocate the temporary files generated
> by Mapper output?  Presumably MR has the same I/O parallelism requirements
> that we do.
> >>
> >> Thanks
> >> John
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:harsh@cloudera.com]
> >> Sent: Sunday, October 20, 2013 10:49 AM
> >> To: <us...@hadoop.apache.org>
> >> Subject: Re: temporary file locations for YARN applications
> >>
> >> Every container gets its own local work directory (You can use the
> relative ./) thats auto-cleaned up at the end of the container's life.
> >> This is the best place to store the temporary files. This is not
> something you need custom configuration for.
> >>
> >> Do the files need to be on a distributed FS or a local one?
> >>
> >> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net>
> wrote:
> >>> We have a pure YARN application (no MapReduce) that has need to store
> >>> a significant amount of temporary data.  How can we know the best
> >>> location for these files?  How can we ensure that our YARN tasks have
> >>> write access to these locations?  Is this something that must be
> configured outside of YARN?
> >>> Thanks,
> >>> John
> >>
> >> --
> >> Harsh J
> >
> >
> >
> > --
> > Harsh J
>
>
>
> --
> Harsh J
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
By the way, it seems that this ended up being a hard-coded environment variable name "LOCAL_DIRS" instead of ApplicationConstants.LOCAL_DIR_ENV, which we can't see defined anywhere.
John


-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Monday, October 21, 2013 12:11 PM
To: <us...@hadoop.apache.org>
Subject: Re: temporary file locations for YARN applications

The dirs in that env-var are app-specific and are for the app's user to utilize. You shouldn't have any permission issues working within them.

The LocalDirAllocator is still somewhat MR-bound but you can still be able to make it work by giving it a config with the values it needs.

On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net> wrote:
> Thanks again.  This gives me a lot of options; we will see what works.
>
> Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?
>
> Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?
>
> Cheers,
> john
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 11:58 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Hi,
>
> MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.
>
> YARN sets an environment property thats a list (comma separated value) 
> of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container 
> can together use. Perhaps read it in with 
> StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL
> _DIR_ENV)); and then round robin internally over those paths (with 
> free space handling)?
>
> Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.
>
> On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
>> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>>
>> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>>
>> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>>
>> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>>
>> Thanks
>> John
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Sunday, October 20, 2013 10:49 AM
>> To: <us...@hadoop.apache.org>
>> Subject: Re: temporary file locations for YARN applications
>>
>> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
>> This is the best place to store the temporary files. This is not something you need custom configuration for.
>>
>> Do the files need to be on a distributed FS or a local one?
>>
>> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>>> We have a pure YARN application (no MapReduce) that has need to 
>>> store a significant amount of temporary data.  How can we know the 
>>> best location for these files?  How can we ensure that our YARN 
>>> tasks have write access to these locations?  Is this something that must be configured outside of YARN?
>>> Thanks,
>>> John
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



--
Harsh J

Re: temporary file locations for YARN applications

Posted by Jian He <jh...@hortonworks.com>.
This post might help a bit.
http://hortonworks.com/blog/management-of-application-dependencies-in-yarn/

Thanks,
Jian


On Mon, Oct 21, 2013 at 11:11 AM, Harsh J <ha...@cloudera.com> wrote:

> The dirs in that env-var are app-specific and are for the app's user
> to utilize. You shouldn't have any permission issues working within
> them.
>
> The LocalDirAllocator is still somewhat MR-bound but you can still be
> able to make it work by giving it a config with the values it needs.
>
> On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net>
> wrote:
> > Thanks again.  This gives me a lot of options; we will see what works.
> >
> > Do you know if there are any permissions issues if we directly access
> the folders of LOCAL_DIR_ENV?
> >
> > Regarding LocalDirAllocator, I see its constructor:
> LocalDirAllocator(String contextCfgItemName) and a note mentioning that an
> example of this item is "mapred.local.dir".  Is that the correct usage, or
> is there something YARN-generic?
> >
> > Cheers,
> > john
> >
> > -----Original Message-----
> > From: Harsh J [mailto:harsh@cloudera.com]
> > Sent: Sunday, October 20, 2013 11:58 PM
> > To: <us...@hadoop.apache.org>
> > Subject: Re: temporary file locations for YARN applications
> >
> > Hi,
> >
> > MR does use multiple disks when spilling. But the work directory is also
> round-robined to spread I/O.
> >
> > YARN sets an environment property thats a list (comma separated value)
> of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container can
> together use. Perhaps read it in with
> StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
> > and then round robin internally over those paths (with free space
> handling)?
> >
> > Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> > class; which is what MR uses. Its not been declared publicly stable
> though, but we can do that over a JIRA.
> >
> > On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net>
> wrote:
> >> Harsh, thanks for the quick response.  These files don't need to be on
> the DFS (although we use that too).  These are local files used during
> sorting, joining, transitive closure.
> >>
> >> The task-relative folder might be good enough, but our app *can* make
> use of multiple temp folders if they are available.  Our YARN app can be
> fairly I/O intensive; is it possible to allocate more than one temp folder
> on different physical devices?
> >>
> >> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on
> different disks so that they do not compete with each other on I/O?
> >>
> >> For that matter, where does MR allocate the temporary files generated
> by Mapper output?  Presumably MR has the same I/O parallelism requirements
> that we do.
> >>
> >> Thanks
> >> John
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:harsh@cloudera.com]
> >> Sent: Sunday, October 20, 2013 10:49 AM
> >> To: <us...@hadoop.apache.org>
> >> Subject: Re: temporary file locations for YARN applications
> >>
> >> Every container gets its own local work directory (You can use the
> relative ./) thats auto-cleaned up at the end of the container's life.
> >> This is the best place to store the temporary files. This is not
> something you need custom configuration for.
> >>
> >> Do the files need to be on a distributed FS or a local one?
> >>
> >> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net>
> wrote:
> >>> We have a pure YARN application (no MapReduce) that has need to store
> >>> a significant amount of temporary data.  How can we know the best
> >>> location for these files?  How can we ensure that our YARN tasks have
> >>> write access to these locations?  Is this something that must be
> configured outside of YARN?
> >>> Thanks,
> >>> John
> >>
> >> --
> >> Harsh J
> >
> >
> >
> > --
> > Harsh J
>
>
>
> --
> Harsh J
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: temporary file locations for YARN applications

Posted by Jian He <jh...@hortonworks.com>.
This post might help a bit.
http://hortonworks.com/blog/management-of-application-dependencies-in-yarn/

Thanks,
Jian


On Mon, Oct 21, 2013 at 11:11 AM, Harsh J <ha...@cloudera.com> wrote:

> The dirs in that env-var are app-specific and are for the app's user
> to utilize. You shouldn't have any permission issues working within
> them.
>
> The LocalDirAllocator is still somewhat MR-bound but you can still be
> able to make it work by giving it a config with the values it needs.
>
> On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net>
> wrote:
> > Thanks again.  This gives me a lot of options; we will see what works.
> >
> > Do you know if there are any permissions issues if we directly access
> the folders of LOCAL_DIR_ENV?
> >
> > Regarding LocalDirAllocator, I see its constructor:
> LocalDirAllocator(String contextCfgItemName) and a note mentioning that an
> example of this item is "mapred.local.dir".  Is that the correct usage, or
> is there something YARN-generic?
> >
> > Cheers,
> > john
> >
> > -----Original Message-----
> > From: Harsh J [mailto:harsh@cloudera.com]
> > Sent: Sunday, October 20, 2013 11:58 PM
> > To: <us...@hadoop.apache.org>
> > Subject: Re: temporary file locations for YARN applications
> >
> > Hi,
> >
> > MR does use multiple disks when spilling. But the work directory is also
> round-robined to spread I/O.
> >
> > YARN sets an environment property thats a list (comma separated value)
> of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container can
> together use. Perhaps read it in with
> StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
> > and then round robin internally over those paths (with free space
> handling)?
> >
> > Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> > class; which is what MR uses. Its not been declared publicly stable
> though, but we can do that over a JIRA.
> >
> > On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net>
> wrote:
> >> Harsh, thanks for the quick response.  These files don't need to be on
> the DFS (although we use that too).  These are local files used during
> sorting, joining, transitive closure.
> >>
> >> The task-relative folder might be good enough, but our app *can* make
> use of multiple temp folders if they are available.  Our YARN app can be
> fairly I/O intensive; is it possible to allocate more than one temp folder
> on different physical devices?
> >>
> >> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on
> different disks so that they do not compete with each other on I/O?
> >>
> >> For that matter, where does MR allocate the temporary files generated
> by Mapper output?  Presumably MR has the same I/O parallelism requirements
> that we do.
> >>
> >> Thanks
> >> John
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:harsh@cloudera.com]
> >> Sent: Sunday, October 20, 2013 10:49 AM
> >> To: <us...@hadoop.apache.org>
> >> Subject: Re: temporary file locations for YARN applications
> >>
> >> Every container gets its own local work directory (You can use the
> relative ./) thats auto-cleaned up at the end of the container's life.
> >> This is the best place to store the temporary files. This is not
> something you need custom configuration for.
> >>
> >> Do the files need to be on a distributed FS or a local one?
> >>
> >> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net>
> wrote:
> >>> We have a pure YARN application (no MapReduce) that has need to store
> >>> a significant amount of temporary data.  How can we know the best
> >>> location for these files?  How can we ensure that our YARN tasks have
> >>> write access to these locations?  Is this something that must be
> configured outside of YARN?
> >>> Thanks,
> >>> John
> >>
> >> --
> >> Harsh J
> >
> >
> >
> > --
> > Harsh J
>
>
>
> --
> Harsh J
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
Thanks, sounds like LOCAL_DIR_ENV is the way to go.
john

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Monday, October 21, 2013 12:11 PM
To: <us...@hadoop.apache.org>
Subject: Re: temporary file locations for YARN applications

The dirs in that env-var are app-specific and are for the app's user to utilize. You shouldn't have any permission issues working within them.

The LocalDirAllocator is still somewhat MR-bound but you can still be able to make it work by giving it a config with the values it needs.

On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net> wrote:
> Thanks again.  This gives me a lot of options; we will see what works.
>
> Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?
>
> Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?
>
> Cheers,
> john
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 11:58 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Hi,
>
> MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.
>
> YARN sets an environment property thats a list (comma separated value) 
> of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container 
> can together use. Perhaps read it in with 
> StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL
> _DIR_ENV)); and then round robin internally over those paths (with 
> free space handling)?
>
> Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.
>
> On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
>> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>>
>> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>>
>> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>>
>> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>>
>> Thanks
>> John
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Sunday, October 20, 2013 10:49 AM
>> To: <us...@hadoop.apache.org>
>> Subject: Re: temporary file locations for YARN applications
>>
>> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
>> This is the best place to store the temporary files. This is not something you need custom configuration for.
>>
>> Do the files need to be on a distributed FS or a local one?
>>
>> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>>> We have a pure YARN application (no MapReduce) that has need to 
>>> store a significant amount of temporary data.  How can we know the 
>>> best location for these files?  How can we ensure that our YARN 
>>> tasks have write access to these locations?  Is this something that must be configured outside of YARN?
>>> Thanks,
>>> John
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



--
Harsh J

Re: temporary file locations for YARN applications

Posted by Jian He <jh...@hortonworks.com>.
This post might help a bit.
http://hortonworks.com/blog/management-of-application-dependencies-in-yarn/

Thanks,
Jian


On Mon, Oct 21, 2013 at 11:11 AM, Harsh J <ha...@cloudera.com> wrote:

> The dirs in that env-var are app-specific and are for the app's user
> to utilize. You shouldn't have any permission issues working within
> them.
>
> The LocalDirAllocator is still somewhat MR-bound but you can still be
> able to make it work by giving it a config with the values it needs.
>
> On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net>
> wrote:
> > Thanks again.  This gives me a lot of options; we will see what works.
> >
> > Do you know if there are any permissions issues if we directly access
> the folders of LOCAL_DIR_ENV?
> >
> > Regarding LocalDirAllocator, I see its constructor:
> LocalDirAllocator(String contextCfgItemName) and a note mentioning that an
> example of this item is "mapred.local.dir".  Is that the correct usage, or
> is there something YARN-generic?
> >
> > Cheers,
> > john
> >
> > -----Original Message-----
> > From: Harsh J [mailto:harsh@cloudera.com]
> > Sent: Sunday, October 20, 2013 11:58 PM
> > To: <us...@hadoop.apache.org>
> > Subject: Re: temporary file locations for YARN applications
> >
> > Hi,
> >
> > MR does use multiple disks when spilling. But the work directory is also
> round-robined to spread I/O.
> >
> > YARN sets an environment property thats a list (comma separated value)
> of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container can
> together use. Perhaps read it in with
> StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
> > and then round robin internally over those paths (with free space
> handling)?
> >
> > Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> > class; which is what MR uses. Its not been declared publicly stable
> though, but we can do that over a JIRA.
> >
> > On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net>
> wrote:
> >> Harsh, thanks for the quick response.  These files don't need to be on
> the DFS (although we use that too).  These are local files used during
> sorting, joining, transitive closure.
> >>
> >> The task-relative folder might be good enough, but our app *can* make
> use of multiple temp folders if they are available.  Our YARN app can be
> fairly I/O intensive; is it possible to allocate more than one temp folder
> on different physical devices?
> >>
> >> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on
> different disks so that they do not compete with each other on I/O?
> >>
> >> For that matter, where does MR allocate the temporary files generated
> by Mapper output?  Presumably MR has the same I/O parallelism requirements
> that we do.
> >>
> >> Thanks
> >> John
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:harsh@cloudera.com]
> >> Sent: Sunday, October 20, 2013 10:49 AM
> >> To: <us...@hadoop.apache.org>
> >> Subject: Re: temporary file locations for YARN applications
> >>
> >> Every container gets its own local work directory (You can use the
> relative ./) thats auto-cleaned up at the end of the container's life.
> >> This is the best place to store the temporary files. This is not
> something you need custom configuration for.
> >>
> >> Do the files need to be on a distributed FS or a local one?
> >>
> >> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net>
> wrote:
> >>> We have a pure YARN application (no MapReduce) that has need to store
> >>> a significant amount of temporary data.  How can we know the best
> >>> location for these files?  How can we ensure that our YARN tasks have
> >>> write access to these locations?  Is this something that must be
> configured outside of YARN?
> >>> Thanks,
> >>> John
> >>
> >> --
> >> Harsh J
> >
> >
> >
> > --
> > Harsh J
>
>
>
> --
> Harsh J
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
By the way, it seems that this ended up being a hard-coded environment variable name "LOCAL_DIRS" instead of ApplicationConstants.LOCAL_DIR_ENV, which we can't see defined anywhere.
John


-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Monday, October 21, 2013 12:11 PM
To: <us...@hadoop.apache.org>
Subject: Re: temporary file locations for YARN applications

The dirs in that env-var are app-specific and are for the app's user to utilize. You shouldn't have any permission issues working within them.

The LocalDirAllocator is still somewhat MR-bound but you can still be able to make it work by giving it a config with the values it needs.

On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net> wrote:
> Thanks again.  This gives me a lot of options; we will see what works.
>
> Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?
>
> Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?
>
> Cheers,
> john
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 11:58 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Hi,
>
> MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.
>
> YARN sets an environment property thats a list (comma separated value) 
> of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container 
> can together use. Perhaps read it in with 
> StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL
> _DIR_ENV)); and then round robin internally over those paths (with 
> free space handling)?
>
> Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.
>
> On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
>> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>>
>> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>>
>> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>>
>> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>>
>> Thanks
>> John
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Sunday, October 20, 2013 10:49 AM
>> To: <us...@hadoop.apache.org>
>> Subject: Re: temporary file locations for YARN applications
>>
>> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
>> This is the best place to store the temporary files. This is not something you need custom configuration for.
>>
>> Do the files need to be on a distributed FS or a local one?
>>
>> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>>> We have a pure YARN application (no MapReduce) that has need to 
>>> store a significant amount of temporary data.  How can we know the 
>>> best location for these files?  How can we ensure that our YARN 
>>> tasks have write access to these locations?  Is this something that must be configured outside of YARN?
>>> Thanks,
>>> John
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



--
Harsh J

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
By the way, it seems that this ended up being a hard-coded environment variable name "LOCAL_DIRS" instead of ApplicationConstants.LOCAL_DIR_ENV, which we can't see defined anywhere.
John


-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Monday, October 21, 2013 12:11 PM
To: <us...@hadoop.apache.org>
Subject: Re: temporary file locations for YARN applications

The dirs in that env-var are app-specific and are for the app's user to utilize. You shouldn't have any permission issues working within them.

The LocalDirAllocator is still somewhat MR-bound but you can still be able to make it work by giving it a config with the values it needs.

On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net> wrote:
> Thanks again.  This gives me a lot of options; we will see what works.
>
> Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?
>
> Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?
>
> Cheers,
> john
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 11:58 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Hi,
>
> MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.
>
> YARN sets an environment property thats a list (comma separated value) 
> of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container 
> can together use. Perhaps read it in with 
> StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL
> _DIR_ENV)); and then round robin internally over those paths (with 
> free space handling)?
>
> Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.
>
> On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
>> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>>
>> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>>
>> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>>
>> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>>
>> Thanks
>> John
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Sunday, October 20, 2013 10:49 AM
>> To: <us...@hadoop.apache.org>
>> Subject: Re: temporary file locations for YARN applications
>>
>> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
>> This is the best place to store the temporary files. This is not something you need custom configuration for.
>>
>> Do the files need to be on a distributed FS or a local one?
>>
>> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>>> We have a pure YARN application (no MapReduce) that has need to 
>>> store a significant amount of temporary data.  How can we know the 
>>> best location for these files?  How can we ensure that our YARN 
>>> tasks have write access to these locations?  Is this something that must be configured outside of YARN?
>>> Thanks,
>>> John
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



--
Harsh J

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
Thanks, sounds like LOCAL_DIR_ENV is the way to go.
john

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Monday, October 21, 2013 12:11 PM
To: <us...@hadoop.apache.org>
Subject: Re: temporary file locations for YARN applications

The dirs in that env-var are app-specific and are for the app's user to utilize. You shouldn't have any permission issues working within them.

The LocalDirAllocator is still somewhat MR-bound but you can still be able to make it work by giving it a config with the values it needs.

On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net> wrote:
> Thanks again.  This gives me a lot of options; we will see what works.
>
> Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?
>
> Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?
>
> Cheers,
> john
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 11:58 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Hi,
>
> MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.
>
> YARN sets an environment property thats a list (comma separated value) 
> of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container 
> can together use. Perhaps read it in with 
> StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL
> _DIR_ENV)); and then round robin internally over those paths (with 
> free space handling)?
>
> Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.
>
> On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
>> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>>
>> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>>
>> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>>
>> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>>
>> Thanks
>> John
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Sunday, October 20, 2013 10:49 AM
>> To: <us...@hadoop.apache.org>
>> Subject: Re: temporary file locations for YARN applications
>>
>> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
>> This is the best place to store the temporary files. This is not something you need custom configuration for.
>>
>> Do the files need to be on a distributed FS or a local one?
>>
>> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>>> We have a pure YARN application (no MapReduce) that has need to 
>>> store a significant amount of temporary data.  How can we know the 
>>> best location for these files?  How can we ensure that our YARN 
>>> tasks have write access to these locations?  Is this something that must be configured outside of YARN?
>>> Thanks,
>>> John
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



--
Harsh J

Re: temporary file locations for YARN applications

Posted by Harsh J <ha...@cloudera.com>.
The dirs in that env-var are app-specific and are for the app's user
to utilize. You shouldn't have any permission issues working within
them.

The LocalDirAllocator is still somewhat MR-bound but you can still be
able to make it work by giving it a config with the values it needs.

On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net> wrote:
> Thanks again.  This gives me a lot of options; we will see what works.
>
> Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?
>
> Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?
>
> Cheers,
> john
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 11:58 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Hi,
>
> MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.
>
> YARN sets an environment property thats a list (comma separated value) of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container can together use. Perhaps read it in with StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
> and then round robin internally over those paths (with free space handling)?
>
> Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.
>
> On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
>> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>>
>> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>>
>> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>>
>> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>>
>> Thanks
>> John
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Sunday, October 20, 2013 10:49 AM
>> To: <us...@hadoop.apache.org>
>> Subject: Re: temporary file locations for YARN applications
>>
>> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
>> This is the best place to store the temporary files. This is not something you need custom configuration for.
>>
>> Do the files need to be on a distributed FS or a local one?
>>
>> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>>> We have a pure YARN application (no MapReduce) that has need to store
>>> a significant amount of temporary data.  How can we know the best
>>> location for these files?  How can we ensure that our YARN tasks have
>>> write access to these locations?  Is this something that must be configured outside of YARN?
>>> Thanks,
>>> John
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: temporary file locations for YARN applications

Posted by Harsh J <ha...@cloudera.com>.
The dirs in that env-var are app-specific and are for the app's user
to utilize. You shouldn't have any permission issues working within
them.

The LocalDirAllocator is still somewhat MR-bound but you can still be
able to make it work by giving it a config with the values it needs.

On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net> wrote:
> Thanks again.  This gives me a lot of options; we will see what works.
>
> Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?
>
> Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?
>
> Cheers,
> john
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 11:58 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Hi,
>
> MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.
>
> YARN sets an environment property thats a list (comma separated value) of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container can together use. Perhaps read it in with StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
> and then round robin internally over those paths (with free space handling)?
>
> Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.
>
> On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
>> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>>
>> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>>
>> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>>
>> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>>
>> Thanks
>> John
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Sunday, October 20, 2013 10:49 AM
>> To: <us...@hadoop.apache.org>
>> Subject: Re: temporary file locations for YARN applications
>>
>> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
>> This is the best place to store the temporary files. This is not something you need custom configuration for.
>>
>> Do the files need to be on a distributed FS or a local one?
>>
>> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>>> We have a pure YARN application (no MapReduce) that has need to store
>>> a significant amount of temporary data.  How can we know the best
>>> location for these files?  How can we ensure that our YARN tasks have
>>> write access to these locations?  Is this something that must be configured outside of YARN?
>>> Thanks,
>>> John
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: temporary file locations for YARN applications

Posted by Harsh J <ha...@cloudera.com>.
The dirs in that env-var are app-specific and are for the app's user
to utilize. You shouldn't have any permission issues working within
them.

The LocalDirAllocator is still somewhat MR-bound but you can still be
able to make it work by giving it a config with the values it needs.

On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net> wrote:
> Thanks again.  This gives me a lot of options; we will see what works.
>
> Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?
>
> Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?
>
> Cheers,
> john
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 11:58 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Hi,
>
> MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.
>
> YARN sets an environment property thats a list (comma separated value) of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container can together use. Perhaps read it in with StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
> and then round robin internally over those paths (with free space handling)?
>
> Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.
>
> On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
>> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>>
>> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>>
>> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>>
>> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>>
>> Thanks
>> John
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Sunday, October 20, 2013 10:49 AM
>> To: <us...@hadoop.apache.org>
>> Subject: Re: temporary file locations for YARN applications
>>
>> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
>> This is the best place to store the temporary files. This is not something you need custom configuration for.
>>
>> Do the files need to be on a distributed FS or a local one?
>>
>> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>>> We have a pure YARN application (no MapReduce) that has need to store
>>> a significant amount of temporary data.  How can we know the best
>>> location for these files?  How can we ensure that our YARN tasks have
>>> write access to these locations?  Is this something that must be configured outside of YARN?
>>> Thanks,
>>> John
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: temporary file locations for YARN applications

Posted by Harsh J <ha...@cloudera.com>.
The dirs in that env-var are app-specific and are for the app's user
to utilize. You shouldn't have any permission issues working within
them.

The LocalDirAllocator is still somewhat MR-bound but you can still be
able to make it work by giving it a config with the values it needs.

On Mon, Oct 21, 2013 at 8:49 PM, John Lilley <jo...@redpoint.net> wrote:
> Thanks again.  This gives me a lot of options; we will see what works.
>
> Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?
>
> Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?
>
> Cheers,
> john
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 11:58 PM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Hi,
>
> MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.
>
> YARN sets an environment property thats a list (comma separated value) of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container can together use. Perhaps read it in with StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
> and then round robin internally over those paths (with free space handling)?
>
> Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
> class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.
>
> On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
>> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>>
>> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>>
>> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>>
>> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>>
>> Thanks
>> John
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Sunday, October 20, 2013 10:49 AM
>> To: <us...@hadoop.apache.org>
>> Subject: Re: temporary file locations for YARN applications
>>
>> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
>> This is the best place to store the temporary files. This is not something you need custom configuration for.
>>
>> Do the files need to be on a distributed FS or a local one?
>>
>> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>>> We have a pure YARN application (no MapReduce) that has need to store
>>> a significant amount of temporary data.  How can we know the best
>>> location for these files?  How can we ensure that our YARN tasks have
>>> write access to these locations?  Is this something that must be configured outside of YARN?
>>> Thanks,
>>> John
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



-- 
Harsh J

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
Thanks again.  This gives me a lot of options; we will see what works.

Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?

Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?

Cheers,
john

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Sunday, October 20, 2013 11:58 PM
To: <us...@hadoop.apache.org>
Subject: Re: temporary file locations for YARN applications

Hi,

MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.

YARN sets an environment property thats a list (comma separated value) of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container can together use. Perhaps read it in with StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
and then round robin internally over those paths (with free space handling)?

Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.

On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>
> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>
> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>
> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>
> Thanks
> John
>
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 10:49 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
> This is the best place to store the temporary files. This is not something you need custom configuration for.
>
> Do the files need to be on a distributed FS or a local one?
>
> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>> We have a pure YARN application (no MapReduce) that has need to store 
>> a significant amount of temporary data.  How can we know the best 
>> location for these files?  How can we ensure that our YARN tasks have 
>> write access to these locations?  Is this something that must be configured outside of YARN?
>> Thanks,
>> John
>
> --
> Harsh J



--
Harsh J

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
Thanks again.  This gives me a lot of options; we will see what works.

Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?

Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?

Cheers,
john

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Sunday, October 20, 2013 11:58 PM
To: <us...@hadoop.apache.org>
Subject: Re: temporary file locations for YARN applications

Hi,

MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.

YARN sets an environment property thats a list (comma separated value) of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container can together use. Perhaps read it in with StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
and then round robin internally over those paths (with free space handling)?

Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.

On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>
> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>
> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>
> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>
> Thanks
> John
>
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 10:49 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
> This is the best place to store the temporary files. This is not something you need custom configuration for.
>
> Do the files need to be on a distributed FS or a local one?
>
> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>> We have a pure YARN application (no MapReduce) that has need to store 
>> a significant amount of temporary data.  How can we know the best 
>> location for these files?  How can we ensure that our YARN tasks have 
>> write access to these locations?  Is this something that must be configured outside of YARN?
>> Thanks,
>> John
>
> --
> Harsh J



--
Harsh J

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
Thanks again.  This gives me a lot of options; we will see what works.

Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?

Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?

Cheers,
john

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Sunday, October 20, 2013 11:58 PM
To: <us...@hadoop.apache.org>
Subject: Re: temporary file locations for YARN applications

Hi,

MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.

YARN sets an environment property thats a list (comma separated value) of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container can together use. Perhaps read it in with StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
and then round robin internally over those paths (with free space handling)?

Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.

On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>
> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>
> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>
> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>
> Thanks
> John
>
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 10:49 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
> This is the best place to store the temporary files. This is not something you need custom configuration for.
>
> Do the files need to be on a distributed FS or a local one?
>
> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>> We have a pure YARN application (no MapReduce) that has need to store 
>> a significant amount of temporary data.  How can we know the best 
>> location for these files?  How can we ensure that our YARN tasks have 
>> write access to these locations?  Is this something that must be configured outside of YARN?
>> Thanks,
>> John
>
> --
> Harsh J



--
Harsh J

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
Thanks again.  This gives me a lot of options; we will see what works.

Do you know if there are any permissions issues if we directly access the folders of LOCAL_DIR_ENV?

Regarding LocalDirAllocator, I see its constructor: LocalDirAllocator(String contextCfgItemName) and a note mentioning that an example of this item is "mapred.local.dir".  Is that the correct usage, or is there something YARN-generic?

Cheers,
john

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Sunday, October 20, 2013 11:58 PM
To: <us...@hadoop.apache.org>
Subject: Re: temporary file locations for YARN applications

Hi,

MR does use multiple disks when spilling. But the work directory is also round-robined to spread I/O.

YARN sets an environment property thats a list (comma separated value) of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container can together use. Perhaps read it in with StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
and then round robin internally over those paths (with free space handling)?

Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
class; which is what MR uses. Its not been declared publicly stable though, but we can do that over a JIRA.

On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>
> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>
> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>
> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>
> Thanks
> John
>
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 10:49 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
> This is the best place to store the temporary files. This is not something you need custom configuration for.
>
> Do the files need to be on a distributed FS or a local one?
>
> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>> We have a pure YARN application (no MapReduce) that has need to store 
>> a significant amount of temporary data.  How can we know the best 
>> location for these files?  How can we ensure that our YARN tasks have 
>> write access to these locations?  Is this something that must be configured outside of YARN?
>> Thanks,
>> John
>
> --
> Harsh J



--
Harsh J

Re: temporary file locations for YARN applications

Posted by Harsh J <ha...@cloudera.com>.
Hi,

MR does use multiple disks when spilling. But the work directory is
also round-robined to spread I/O.

YARN sets an environment property thats a list (comma separated value)
of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container
can together use. Perhaps read it in with
StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
and then round robin internally over those paths (with free space
handling)?

Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
class; which is what MR uses. Its not been declared publicly stable
though, but we can do that over a JIRA.

On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>
> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>
> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>
> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>
> Thanks
> John
>
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 10:49 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
> This is the best place to store the temporary files. This is not something you need custom configuration for.
>
> Do the files need to be on a distributed FS or a local one?
>
> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>> We have a pure YARN application (no MapReduce) that has need to store
>> a significant amount of temporary data.  How can we know the best
>> location for these files?  How can we ensure that our YARN tasks have
>> write access to these locations?  Is this something that must be configured outside of YARN?
>> Thanks,
>> John
>
> --
> Harsh J



-- 
Harsh J

Re: temporary file locations for YARN applications

Posted by Harsh J <ha...@cloudera.com>.
Hi,

MR does use multiple disks when spilling. But the work directory is
also round-robined to spread I/O.

YARN sets an environment property thats a list (comma separated value)
of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container
can together use. Perhaps read it in with
StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
and then round robin internally over those paths (with free space
handling)?

Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
class; which is what MR uses. Its not been declared publicly stable
though, but we can do that over a JIRA.

On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>
> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>
> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>
> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>
> Thanks
> John
>
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 10:49 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
> This is the best place to store the temporary files. This is not something you need custom configuration for.
>
> Do the files need to be on a distributed FS or a local one?
>
> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>> We have a pure YARN application (no MapReduce) that has need to store
>> a significant amount of temporary data.  How can we know the best
>> location for these files?  How can we ensure that our YARN tasks have
>> write access to these locations?  Is this something that must be configured outside of YARN?
>> Thanks,
>> John
>
> --
> Harsh J



-- 
Harsh J

Re: temporary file locations for YARN applications

Posted by Harsh J <ha...@cloudera.com>.
Hi,

MR does use multiple disks when spilling. But the work directory is
also round-robined to spread I/O.

YARN sets an environment property thats a list (comma separated value)
of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container
can together use. Perhaps read it in with
StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
and then round robin internally over those paths (with free space
handling)?

Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
class; which is what MR uses. Its not been declared publicly stable
though, but we can do that over a JIRA.

On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>
> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>
> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>
> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>
> Thanks
> John
>
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 10:49 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
> This is the best place to store the temporary files. This is not something you need custom configuration for.
>
> Do the files need to be on a distributed FS or a local one?
>
> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>> We have a pure YARN application (no MapReduce) that has need to store
>> a significant amount of temporary data.  How can we know the best
>> location for these files?  How can we ensure that our YARN tasks have
>> write access to these locations?  Is this something that must be configured outside of YARN?
>> Thanks,
>> John
>
> --
> Harsh J



-- 
Harsh J

Re: temporary file locations for YARN applications

Posted by Harsh J <ha...@cloudera.com>.
Hi,

MR does use multiple disks when spilling. But the work directory is
also round-robined to spread I/O.

YARN sets an environment property thats a list (comma separated value)
of directories (ApplicationConstants.LOCAL_DIR_ENV) your app container
can together use. Perhaps read it in with
StringUtils.getTrimmedStrings(System.getenv(ApplicationConstants.LOCAL_DIR_ENV));
and then round robin internally over those paths (with free space
handling)?

Perhaps you can even reuse the org.apache.hadoop.fs.LocalDirAllocator
class; which is what MR uses. Its not been declared publicly stable
though, but we can do that over a JIRA.

On Mon, Oct 21, 2013 at 2:05 AM, John Lilley <jo...@redpoint.net> wrote:
> Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.
>
> The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?
>
> Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?
>
> For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.
>
> Thanks
> John
>
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Sunday, October 20, 2013 10:49 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: temporary file locations for YARN applications
>
> Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
> This is the best place to store the temporary files. This is not something you need custom configuration for.
>
> Do the files need to be on a distributed FS or a local one?
>
> On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
>> We have a pure YARN application (no MapReduce) that has need to store
>> a significant amount of temporary data.  How can we know the best
>> location for these files?  How can we ensure that our YARN tasks have
>> write access to these locations?  Is this something that must be configured outside of YARN?
>> Thanks,
>> John
>
> --
> Harsh J



-- 
Harsh J

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.  

The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?  

Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?  

For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.

Thanks
John


-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Sunday, October 20, 2013 10:49 AM
To: <us...@hadoop.apache.org>
Subject: Re: temporary file locations for YARN applications

Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
This is the best place to store the temporary files. This is not something you need custom configuration for.

Do the files need to be on a distributed FS or a local one?

On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
> We have a pure YARN application (no MapReduce) that has need to store 
> a significant amount of temporary data.  How can we know the best 
> location for these files?  How can we ensure that our YARN tasks have 
> write access to these locations?  Is this something that must be configured outside of YARN?
> Thanks,
> John

--
Harsh J

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.  

The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?  

Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?  

For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.

Thanks
John


-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Sunday, October 20, 2013 10:49 AM
To: <us...@hadoop.apache.org>
Subject: Re: temporary file locations for YARN applications

Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
This is the best place to store the temporary files. This is not something you need custom configuration for.

Do the files need to be on a distributed FS or a local one?

On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
> We have a pure YARN application (no MapReduce) that has need to store 
> a significant amount of temporary data.  How can we know the best 
> location for these files?  How can we ensure that our YARN tasks have 
> write access to these locations?  Is this something that must be configured outside of YARN?
> Thanks,
> John

--
Harsh J

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.  

The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?  

Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?  

For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.

Thanks
John


-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Sunday, October 20, 2013 10:49 AM
To: <us...@hadoop.apache.org>
Subject: Re: temporary file locations for YARN applications

Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
This is the best place to store the temporary files. This is not something you need custom configuration for.

Do the files need to be on a distributed FS or a local one?

On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
> We have a pure YARN application (no MapReduce) that has need to store 
> a significant amount of temporary data.  How can we know the best 
> location for these files?  How can we ensure that our YARN tasks have 
> write access to these locations?  Is this something that must be configured outside of YARN?
> Thanks,
> John

--
Harsh J

RE: temporary file locations for YARN applications

Posted by John Lilley <jo...@redpoint.net>.
Harsh, thanks for the quick response.  These files don't need to be on the DFS (although we use that too).  These are local files used during sorting, joining, transitive closure.  

The task-relative folder might be good enough, but our app *can* make use of multiple temp folders if they are available.  Our YARN app can be fairly I/O intensive; is it possible to allocate more than one temp folder on different physical devices?  

Or perhaps YARN might help us. Will YARN assign tasks to CWD folders on different disks so that they do not compete with each other on I/O?  

For that matter, where does MR allocate the temporary files generated by Mapper output?  Presumably MR has the same I/O parallelism requirements that we do.

Thanks
John


-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Sunday, October 20, 2013 10:49 AM
To: <us...@hadoop.apache.org>
Subject: Re: temporary file locations for YARN applications

Every container gets its own local work directory (You can use the relative ./) thats auto-cleaned up at the end of the container's life.
This is the best place to store the temporary files. This is not something you need custom configuration for.

Do the files need to be on a distributed FS or a local one?

On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
> We have a pure YARN application (no MapReduce) that has need to store 
> a significant amount of temporary data.  How can we know the best 
> location for these files?  How can we ensure that our YARN tasks have 
> write access to these locations?  Is this something that must be configured outside of YARN?
> Thanks,
> John

--
Harsh J

Re: temporary file locations for YARN applications

Posted by Harsh J <ha...@cloudera.com>.
Every container gets its own local work directory (You can use the
relative ./) thats auto-cleaned up at the end of the container's life.
This is the best place to store the temporary files. This is not
something you need custom configuration for.

Do the files need to be on a distributed FS or a local one?

On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
> We have a pure YARN application (no MapReduce) that has need to store a
> significant amount of temporary data.  How can we know the best location for
> these files?  How can we ensure that our YARN tasks have write access to
> these locations?  Is this something that must be configured outside of YARN?
>
> Thanks,
>
> John
>
>
>
>



-- 
Harsh J

Re: temporary file locations for YARN applications

Posted by Harsh J <ha...@cloudera.com>.
Every container gets its own local work directory (You can use the
relative ./) thats auto-cleaned up at the end of the container's life.
This is the best place to store the temporary files. This is not
something you need custom configuration for.

Do the files need to be on a distributed FS or a local one?

On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
> We have a pure YARN application (no MapReduce) that has need to store a
> significant amount of temporary data.  How can we know the best location for
> these files?  How can we ensure that our YARN tasks have write access to
> these locations?  Is this something that must be configured outside of YARN?
>
> Thanks,
>
> John
>
>
>
>



-- 
Harsh J

Re: temporary file locations for YARN applications

Posted by Harsh J <ha...@cloudera.com>.
Every container gets its own local work directory (You can use the
relative ./) thats auto-cleaned up at the end of the container's life.
This is the best place to store the temporary files. This is not
something you need custom configuration for.

Do the files need to be on a distributed FS or a local one?

On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
> We have a pure YARN application (no MapReduce) that has need to store a
> significant amount of temporary data.  How can we know the best location for
> these files?  How can we ensure that our YARN tasks have write access to
> these locations?  Is this something that must be configured outside of YARN?
>
> Thanks,
>
> John
>
>
>
>



-- 
Harsh J

Re: temporary file locations for YARN applications

Posted by Harsh J <ha...@cloudera.com>.
Every container gets its own local work directory (You can use the
relative ./) thats auto-cleaned up at the end of the container's life.
This is the best place to store the temporary files. This is not
something you need custom configuration for.

Do the files need to be on a distributed FS or a local one?

On Sun, Oct 20, 2013 at 8:54 PM, John Lilley <jo...@redpoint.net> wrote:
> We have a pure YARN application (no MapReduce) that has need to store a
> significant amount of temporary data.  How can we know the best location for
> these files?  How can we ensure that our YARN tasks have write access to
> these locations?  Is this something that must be configured outside of YARN?
>
> Thanks,
>
> John
>
>
>
>



-- 
Harsh J