You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Nan Zhu <zh...@gmail.com> on 2011/01/13 09:27:28 UTC

Why Hadoop uses HTTP for file transmission between Map and Reduce?

Hi, all

I have a question about the file transmission between Map and Reduce stage,
in current implementation, the Reducers get the results generated by Mappers
through HTTP Get, I don't understand why HTTP is selected, why not FTP, or a
self-developed protocal?

Just for HTTP's simple?

thanks

Nan

Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?

Posted by Patrick Julien <pj...@gmail.com>.
since you're an apache project, is there a reason you would favor
netty over apache's own mina?

On Sun, Jan 16, 2011 at 2:21 AM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Hi,
>
>
> ----- Original Message ----
>> From: Owen O'Malley <om...@apache.org>
>>
>> At some point, we'll replace Jetty in the shuffle, because it imposes  too
>> much overhead and go to Netty or some other lower level library. I  don't
>> think that using HTTP adds that much overhead although it would  be
>> interesting to measure that.
>
> I use Jetty in a lot of places, so I'm curious about this Jetty + overhead
> comment.  Could you please share more info about what sort of overhead you are
> referring to? (also, does it apply to all recent versions of Jetty - 6, 7, and
> 8?)
>
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>

Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,


----- Original Message ----
> From: Owen O'Malley <om...@apache.org>
>
> At some point, we'll replace Jetty in the shuffle, because it imposes  too
> much overhead and go to Netty or some other lower level library. I  don't
> think that using HTTP adds that much overhead although it would  be
> interesting to measure that.

I use Jetty in a lot of places, so I'm curious about this Jetty + overhead 
comment.  Could you please share more info about what sort of overhead you are 
referring to? (also, does it apply to all recent versions of Jetty - 6, 7, and 
8?)

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/


Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?

Posted by Owen O'Malley <om...@apache.org>.
At some point, we'll replace Jetty in the shuffle, because it imposes too
much overhead and go to Netty or some other lower level library. I don't
think that using HTTP adds that much overhead although it would be
interesting to measure that.

-- Owen

Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?

Posted by He Chen <ai...@gmail.com>.
Actually, PhedEx is using GridFTP for its data transferring.

On Thu, Jan 13, 2011 at 5:34 AM, Steve Loughran <st...@apache.org> wrote:

> On 13/01/11 08:34, li ping wrote:
>
>> That is also my concerns. Is it efficient for data transmission.
>>
>
> It's long lived TCP connections, reasonably efficient for bulk data xfer,
> has all the throttling of TCP built in, and comes with some excellently
> debugged client and server code in the form of jetty and httpclient. In
> maintenance costs alone, those libraries justify HTTP unless you have a
> vastly superior option *and are willing to maintain it forever*
>
> FTPs limits are well known (security), NFS limits well known (security, UDP
> version doesn't throttle), self developed protocols will have whatever
> problems you want.
>
> There are better protocols for long-haul data transfer over fat pipes, such
> as GridFTP , PhedEX ( http://www.gridpp.ac.uk/papers/ah05_phedex.pdf ),
> which use multiple TCP channels in parallel to reduce the impact of a single
> lost packet, but within a datacentre, you shouldn't have to worry about
> this. If you do find lots of packets get lost, raise the issue with the
> networking team.
>
> -Steve
>
>
>
>> On Thu, Jan 13, 2011 at 4:27 PM, Nan Zhu<zh...@gmail.com>  wrote:
>>
>>  Hi, all
>>>
>>> I have a question about the file transmission between Map and Reduce
>>> stage,
>>> in current implementation, the Reducers get the results generated by
>>> Mappers
>>> through HTTP Get, I don't understand why HTTP is selected, why not FTP,
>>> or
>>> a
>>> self-developed protocal?
>>>
>>> Just for HTTP's simple?
>>>
>>> thanks
>>>
>>> Nan
>>>
>>>
>>
>>
>>
>

Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?

Posted by Steve Loughran <st...@apache.org>.
On 13/01/11 08:34, li ping wrote:
> That is also my concerns. Is it efficient for data transmission.

It's long lived TCP connections, reasonably efficient for bulk data 
xfer, has all the throttling of TCP built in, and comes with some 
excellently debugged client and server code in the form of jetty and 
httpclient. In maintenance costs alone, those libraries justify HTTP 
unless you have a vastly superior option *and are willing to maintain it 
forever*

FTPs limits are well known (security), NFS limits well known (security, 
UDP version doesn't throttle), self developed protocols will have 
whatever problems you want.

There are better protocols for long-haul data transfer over fat pipes, 
such as GridFTP , PhedEX ( 
http://www.gridpp.ac.uk/papers/ah05_phedex.pdf ), which use multiple TCP 
channels in parallel to reduce the impact of a single lost packet, but 
within a datacentre, you shouldn't have to worry about this. If you do 
find lots of packets get lost, raise the issue with the networking team.

-Steve

>
> On Thu, Jan 13, 2011 at 4:27 PM, Nan Zhu<zh...@gmail.com>  wrote:
>
>> Hi, all
>>
>> I have a question about the file transmission between Map and Reduce stage,
>> in current implementation, the Reducers get the results generated by
>> Mappers
>> through HTTP Get, I don't understand why HTTP is selected, why not FTP, or
>> a
>> self-developed protocal?
>>
>> Just for HTTP's simple?
>>
>> thanks
>>
>> Nan
>>
>
>
>


Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?

Posted by li ping <li...@gmail.com>.
That is also my concerns. Is it efficient for data transmission.

On Thu, Jan 13, 2011 at 4:27 PM, Nan Zhu <zh...@gmail.com> wrote:

> Hi, all
>
> I have a question about the file transmission between Map and Reduce stage,
> in current implementation, the Reducers get the results generated by
> Mappers
> through HTTP Get, I don't understand why HTTP is selected, why not FTP, or
> a
> self-developed protocal?
>
> Just for HTTP's simple?
>
> thanks
>
> Nan
>



-- 
-----李平