You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by John Lilley <jo...@redpoint.net> on 2013/05/23 23:44:12 UTC

HTTP file server, map output, and other files

Thanks to previous kind answers and more reading in the elephant book, I now understand that mapper tasks place partitioned results into local files that are served up to reducers via HTTP:

"The output file's partitions are made available to the reducers over HTTP. The maximum number of worker threads used to serve the file partitions is controlled by the tasktracker.http.threads property; this setting is per tasktracker, not per map task slot. The default of 40 may need to be increased for large clusters running large jobs. In MapReduce 2, this property is not applicable because the maximum number of threads used is set automatically based on the number of processors on the machine. (MapReduce 2 uses Netty, which by default allows up to twice as many threads as there are processors.)"

My question is, for a custom (non-MR) application under YARN, how would I set up my application tasks' output data to be served over HTTP?  Is there an API to control this, or are there predefined local folders that will be served up?  Once I am finished with the temporary data, how do I request that the files are removed?

Thanks
John


Re: HTTP file server, map output, and other files

Posted by Harsh J <ha...@cloudera.com>.
YARN has a ShuffleHandler plugin used for MR purposes, but the APIs
used here aren't "general"/public so you'd have to build your own
utilities to do that. Its not too difficult to achieve but a general
API would certainly be nice.

Tez (Incubating) aims to solve some of this for users writing YARN
apps in a general way, but it isn't consumable yet. You can follow Tez
on the Apache Incubator at
http://incubator.apache.org/projects/tez.html.

P.s. As mentioned, YARN-based MR2 does not use HTTP (Jetty) anymore.
It uses Netty.

On Fri, May 24, 2013 at 3:14 AM, John Lilley <jo...@redpoint.net> wrote:
> Thanks to previous kind answers and more reading in the elephant book, I now
> understand that mapper tasks place partitioned results into local files that
> are served up to reducers via HTTP:
>
>
>
> “The output file’s partitions are made available to the reducers over HTTP.
> The maximum number of worker threads used to serve the file partitions is
> controlled by the tasktracker.http.threads property; this setting is per
> tasktracker, not per map task slot. The default of 40 may need to be
> increased for large clusters running large jobs. In MapReduce 2, this
> property is not applicable because the maximum number of threads used is set
> automatically based on the number of processors on the machine. (MapReduce 2
> uses Netty, which by default allows up to twice as many threads as there are
> processors.)”
>
>
>
> My question is, for a custom (non-MR) application under YARN, how would I
> set up my application tasks’ output data to be served over HTTP?  Is there
> an API to control this, or are there predefined local folders that will be
> served up?  Once I am finished with the temporary data, how do I request
> that the files are removed?
>
>
>
> Thanks
>
> John
>
>



--
Harsh J

Re: HTTP file server, map output, and other files

Posted by Harsh J <ha...@cloudera.com>.
YARN has a ShuffleHandler plugin used for MR purposes, but the APIs
used here aren't "general"/public so you'd have to build your own
utilities to do that. Its not too difficult to achieve but a general
API would certainly be nice.

Tez (Incubating) aims to solve some of this for users writing YARN
apps in a general way, but it isn't consumable yet. You can follow Tez
on the Apache Incubator at
http://incubator.apache.org/projects/tez.html.

P.s. As mentioned, YARN-based MR2 does not use HTTP (Jetty) anymore.
It uses Netty.

On Fri, May 24, 2013 at 3:14 AM, John Lilley <jo...@redpoint.net> wrote:
> Thanks to previous kind answers and more reading in the elephant book, I now
> understand that mapper tasks place partitioned results into local files that
> are served up to reducers via HTTP:
>
>
>
> “The output file’s partitions are made available to the reducers over HTTP.
> The maximum number of worker threads used to serve the file partitions is
> controlled by the tasktracker.http.threads property; this setting is per
> tasktracker, not per map task slot. The default of 40 may need to be
> increased for large clusters running large jobs. In MapReduce 2, this
> property is not applicable because the maximum number of threads used is set
> automatically based on the number of processors on the machine. (MapReduce 2
> uses Netty, which by default allows up to twice as many threads as there are
> processors.)”
>
>
>
> My question is, for a custom (non-MR) application under YARN, how would I
> set up my application tasks’ output data to be served over HTTP?  Is there
> an API to control this, or are there predefined local folders that will be
> served up?  Once I am finished with the temporary data, how do I request
> that the files are removed?
>
>
>
> Thanks
>
> John
>
>



--
Harsh J

Re: HTTP file server, map output, and other files

Posted by Harsh J <ha...@cloudera.com>.
YARN has a ShuffleHandler plugin used for MR purposes, but the APIs
used here aren't "general"/public so you'd have to build your own
utilities to do that. Its not too difficult to achieve but a general
API would certainly be nice.

Tez (Incubating) aims to solve some of this for users writing YARN
apps in a general way, but it isn't consumable yet. You can follow Tez
on the Apache Incubator at
http://incubator.apache.org/projects/tez.html.

P.s. As mentioned, YARN-based MR2 does not use HTTP (Jetty) anymore.
It uses Netty.

On Fri, May 24, 2013 at 3:14 AM, John Lilley <jo...@redpoint.net> wrote:
> Thanks to previous kind answers and more reading in the elephant book, I now
> understand that mapper tasks place partitioned results into local files that
> are served up to reducers via HTTP:
>
>
>
> “The output file’s partitions are made available to the reducers over HTTP.
> The maximum number of worker threads used to serve the file partitions is
> controlled by the tasktracker.http.threads property; this setting is per
> tasktracker, not per map task slot. The default of 40 may need to be
> increased for large clusters running large jobs. In MapReduce 2, this
> property is not applicable because the maximum number of threads used is set
> automatically based on the number of processors on the machine. (MapReduce 2
> uses Netty, which by default allows up to twice as many threads as there are
> processors.)”
>
>
>
> My question is, for a custom (non-MR) application under YARN, how would I
> set up my application tasks’ output data to be served over HTTP?  Is there
> an API to control this, or are there predefined local folders that will be
> served up?  Once I am finished with the temporary data, how do I request
> that the files are removed?
>
>
>
> Thanks
>
> John
>
>



--
Harsh J

Re: HTTP file server, map output, and other files

Posted by Harsh J <ha...@cloudera.com>.
YARN has a ShuffleHandler plugin used for MR purposes, but the APIs
used here aren't "general"/public so you'd have to build your own
utilities to do that. Its not too difficult to achieve but a general
API would certainly be nice.

Tez (Incubating) aims to solve some of this for users writing YARN
apps in a general way, but it isn't consumable yet. You can follow Tez
on the Apache Incubator at
http://incubator.apache.org/projects/tez.html.

P.s. As mentioned, YARN-based MR2 does not use HTTP (Jetty) anymore.
It uses Netty.

On Fri, May 24, 2013 at 3:14 AM, John Lilley <jo...@redpoint.net> wrote:
> Thanks to previous kind answers and more reading in the elephant book, I now
> understand that mapper tasks place partitioned results into local files that
> are served up to reducers via HTTP:
>
>
>
> “The output file’s partitions are made available to the reducers over HTTP.
> The maximum number of worker threads used to serve the file partitions is
> controlled by the tasktracker.http.threads property; this setting is per
> tasktracker, not per map task slot. The default of 40 may need to be
> increased for large clusters running large jobs. In MapReduce 2, this
> property is not applicable because the maximum number of threads used is set
> automatically based on the number of processors on the machine. (MapReduce 2
> uses Netty, which by default allows up to twice as many threads as there are
> processors.)”
>
>
>
> My question is, for a custom (non-MR) application under YARN, how would I
> set up my application tasks’ output data to be served over HTTP?  Is there
> an API to control this, or are there predefined local folders that will be
> served up?  Once I am finished with the temporary data, how do I request
> that the files are removed?
>
>
>
> Thanks
>
> John
>
>



--
Harsh J