You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Nan Zhu <zh...@gmail.com> on 2011/01/13 09:27:28 UTC
Why Hadoop uses HTTP for file transmission between Map and Reduce?
Hi, all
I have a question about the file transmission between Map and Reduce stage,
in current implementation, the Reducers get the results generated by Mappers
through HTTP Get, I don't understand why HTTP is selected, why not FTP, or a
self-developed protocal?
Just for HTTP's simple?
thanks
Nan
Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?
Posted by Patrick Julien <pj...@gmail.com>.
since you're an apache project, is there a reason you would favor
netty over apache's own mina?
On Sun, Jan 16, 2011 at 2:21 AM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Hi,
>
>
> ----- Original Message ----
>> From: Owen O'Malley <om...@apache.org>
>>
>> At some point, we'll replace Jetty in the shuffle, because it imposes too
>> much overhead and go to Netty or some other lower level library. I don't
>> think that using HTTP adds that much overhead although it would be
>> interesting to measure that.
>
> I use Jetty in a lot of places, so I'm curious about this Jetty + overhead
> comment. Could you please share more info about what sort of overhead you are
> referring to? (also, does it apply to all recent versions of Jetty - 6, 7, and
> 8?)
>
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>
Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,
----- Original Message ----
> From: Owen O'Malley <om...@apache.org>
>
> At some point, we'll replace Jetty in the shuffle, because it imposes too
> much overhead and go to Netty or some other lower level library. I don't
> think that using HTTP adds that much overhead although it would be
> interesting to measure that.
I use Jetty in a lot of places, so I'm curious about this Jetty + overhead
comment. Could you please share more info about what sort of overhead you are
referring to? (also, does it apply to all recent versions of Jetty - 6, 7, and
8?)
Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/
Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?
Posted by Owen O'Malley <om...@apache.org>.
At some point, we'll replace Jetty in the shuffle, because it imposes too
much overhead and go to Netty or some other lower level library. I don't
think that using HTTP adds that much overhead although it would be
interesting to measure that.
-- Owen
Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?
Posted by He Chen <ai...@gmail.com>.
Actually, PhedEx is using GridFTP for its data transferring.
On Thu, Jan 13, 2011 at 5:34 AM, Steve Loughran <st...@apache.org> wrote:
> On 13/01/11 08:34, li ping wrote:
>
>> That is also my concerns. Is it efficient for data transmission.
>>
>
> It's long lived TCP connections, reasonably efficient for bulk data xfer,
> has all the throttling of TCP built in, and comes with some excellently
> debugged client and server code in the form of jetty and httpclient. In
> maintenance costs alone, those libraries justify HTTP unless you have a
> vastly superior option *and are willing to maintain it forever*
>
> FTPs limits are well known (security), NFS limits well known (security, UDP
> version doesn't throttle), self developed protocols will have whatever
> problems you want.
>
> There are better protocols for long-haul data transfer over fat pipes, such
> as GridFTP , PhedEX ( http://www.gridpp.ac.uk/papers/ah05_phedex.pdf ),
> which use multiple TCP channels in parallel to reduce the impact of a single
> lost packet, but within a datacentre, you shouldn't have to worry about
> this. If you do find lots of packets get lost, raise the issue with the
> networking team.
>
> -Steve
>
>
>
>> On Thu, Jan 13, 2011 at 4:27 PM, Nan Zhu<zh...@gmail.com> wrote:
>>
>> Hi, all
>>>
>>> I have a question about the file transmission between Map and Reduce
>>> stage,
>>> in current implementation, the Reducers get the results generated by
>>> Mappers
>>> through HTTP Get, I don't understand why HTTP is selected, why not FTP,
>>> or
>>> a
>>> self-developed protocal?
>>>
>>> Just for HTTP's simple?
>>>
>>> thanks
>>>
>>> Nan
>>>
>>>
>>
>>
>>
>
Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?
Posted by Steve Loughran <st...@apache.org>.
On 13/01/11 08:34, li ping wrote:
> That is also my concerns. Is it efficient for data transmission.
It's long lived TCP connections, reasonably efficient for bulk data
xfer, has all the throttling of TCP built in, and comes with some
excellently debugged client and server code in the form of jetty and
httpclient. In maintenance costs alone, those libraries justify HTTP
unless you have a vastly superior option *and are willing to maintain it
forever*
FTPs limits are well known (security), NFS limits well known (security,
UDP version doesn't throttle), self developed protocols will have
whatever problems you want.
There are better protocols for long-haul data transfer over fat pipes,
such as GridFTP , PhedEX (
http://www.gridpp.ac.uk/papers/ah05_phedex.pdf ), which use multiple TCP
channels in parallel to reduce the impact of a single lost packet, but
within a datacentre, you shouldn't have to worry about this. If you do
find lots of packets get lost, raise the issue with the networking team.
-Steve
>
> On Thu, Jan 13, 2011 at 4:27 PM, Nan Zhu<zh...@gmail.com> wrote:
>
>> Hi, all
>>
>> I have a question about the file transmission between Map and Reduce stage,
>> in current implementation, the Reducers get the results generated by
>> Mappers
>> through HTTP Get, I don't understand why HTTP is selected, why not FTP, or
>> a
>> self-developed protocal?
>>
>> Just for HTTP's simple?
>>
>> thanks
>>
>> Nan
>>
>
>
>
Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?
Posted by li ping <li...@gmail.com>.
That is also my concerns. Is it efficient for data transmission.
On Thu, Jan 13, 2011 at 4:27 PM, Nan Zhu <zh...@gmail.com> wrote:
> Hi, all
>
> I have a question about the file transmission between Map and Reduce stage,
> in current implementation, the Reducers get the results generated by
> Mappers
> through HTTP Get, I don't understand why HTTP is selected, why not FTP, or
> a
> self-developed protocal?
>
> Just for HTTP's simple?
>
> thanks
>
> Nan
>
--
-----李平