You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Wasim Bari <wa...@msn.com> on 2009/02/10 23:10:19 UTC

File Transfer Rates

Hi,
    Could someone help me to find some real Figures (transfer rate) about Hadoop File transfer  from local filesystem to HDFS, S3 etc and among Storage Systems (HDFS to S3 etc)

Thanks,

Wasim

Re: File Transfer Rates

Posted by Steve Loughran <st...@apache.org>.

Brian Bockelman wrote:
> Just to toss out some numbers.... (and because our users are making 
> interesting numbers right now)
> 
> Here's our external network router: 
> http://mrtg.unl.edu/~cricket/?target=%2Frouter-interfaces%2Fborder2%2Ftengigabitethernet2_2;view=Octets 
> 
> 
> Here's the application-level transfer graph: 
> http://t2.unl.edu/phedex/graphs/quantity_rates?link=src&no_mss=true&to_node=Nebraska 
> 
> 
> In a squeeze, we can move 20-50TB / day to/from other heterogenous 
> sites.  Usually, we run out of free space before we can find the upper 
> limit for a 24-hour period.
> 
> We use a protocol called GridFTP to move data back and forth between 
> external (non-HDFS) clusters.  The other sites we transfer with use 
> niche software you probably haven't heard of (Castor, DPM, and dCache) 
> because, well, it's niche software.  I have no available data on 
> HDFS<->S3 systems, but I'd again claim it's mostly a function of the 
> amount of hardware you throw at it and the size of your network pipes.
> 
> There are currently 182 datanodes; 180 are "traditional" ones of <3TB 
> and 2 are big honking RAID arrays of 40TB.  Transfers are load-balanced 
> amongst ~ 7 GridFTP servers which each have 1Gbps connection.
> 

GridFTP is optimised for high bandwidth network connections with 
negotiated packet size and multiple TCP connections, so when nagel's 
algorithm triggers backoff from a dropped packet, only a fraction of the 
transmission gets dropped. It is probably best-in-class for long haul 
transfers over the big university backbones where someone else pays for 
your traffic. You would be very hard pressed to get even close to that 
on any other protocol.

I have no data on S3 xfers other than hearsay
  * write time to S3 can be slow as it doesn't return until the data is 
persisted "somewhere". That's a better guarantee than a posix write 
operation.
  * you have to rely on other people on your rack not wanting all the 
traffic for themselves. That's an EC2 API issue: you don't get to 
request/buy bandwidth to/from S3

One thing to remember is that if you bring up a Hadoop cluster on any 
virtual server farm, disk IO is going to be way below physical IO rates. 
Even when the data is in HDFS, it will be slower to get at than 
dedicated high-RPM SCSI or SATA storage.

Re: File Transfer Rates

Posted by Mark Kerzner <ma...@gmail.com>.

I say, that's very interesting and useful.

On Tue, Feb 10, 2009 at 11:37 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Just to toss out some numbers.... (and because our users are making
> interesting numbers right now)
>
> Here's our external network router:
> http://mrtg.unl.edu/~cricket/?target=%2Frouter-interfaces%2Fborder2%2Ftengigabitethernet2_2;view=Octets
>
> Here's the application-level transfer graph:
> http://t2.unl.edu/phedex/graphs/quantity_rates?link=src&no_mss=true&to_node=Nebraska
>
> In a squeeze, we can move 20-50TB / day to/from other heterogenous sites.
>  Usually, we run out of free space before we can find the upper limit for a
> 24-hour period.
>
> We use a protocol called GridFTP to move data back and forth between
> external (non-HDFS) clusters.  The other sites we transfer with use niche
> software you probably haven't heard of (Castor, DPM, and dCache) because,
> well, it's niche software.  I have no available data on HDFS<->S3 systems,
> but I'd again claim it's mostly a function of the amount of hardware you
> throw at it and the size of your network pipes.
>
> There are currently 182 datanodes; 180 are "traditional" ones of <3TB and 2
> are big honking RAID arrays of 40TB.  Transfers are load-balanced amongst ~
> 7 GridFTP servers which each have 1Gbps connection.
>
> Does that help?
>
> Brian
>
>
> On Feb 10, 2009, at 4:46 PM, Brian Bockelman wrote:
>
>
>> On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:
>>
>>  Hi,
>>>  Could someone help me to find some real Figures (transfer rate) about
>>> Hadoop File transfer  from local filesystem to HDFS, S3 etc and among
>>> Storage Systems (HDFS to S3 etc)
>>>
>>> Thanks,
>>>
>>> Wasim
>>>
>>
>> What are you looking for?  Maximum possible transfer rate?  Maximum
>> possible transfer rate per client?  Generally, if you're using the Java
>> client, transfer rate to/from HDFS is limited by the hardware you have and
>> the network connection (if you have 1Gbps per client).
>>
>> I could give you a graph showing a peak of 9Gbps from our Hadoop instance
>> to the WAN, but that's not very interesting if you don't have a 10Gbps
>> pipe...
>>
>> Brian
>>
>
>

Re: File Transfer Rates

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Just to toss out some numbers.... (and because our users are making  
interesting numbers right now)

Here's our external network router: http://mrtg.unl.edu/~cricket/?target=%2Frouter-interfaces%2Fborder2%2Ftengigabitethernet2_2;view=Octets

Here's the application-level transfer graph: http://t2.unl.edu/phedex/graphs/quantity_rates?link=src&no_mss=true&to_node=Nebraska

In a squeeze, we can move 20-50TB / day to/from other heterogenous  
sites.  Usually, we run out of free space before we can find the upper  
limit for a 24-hour period.

We use a protocol called GridFTP to move data back and forth between  
external (non-HDFS) clusters.  The other sites we transfer with use  
niche software you probably haven't heard of (Castor, DPM, and dCache)  
because, well, it's niche software.  I have no available data on HDFS<- 
 >S3 systems, but I'd again claim it's mostly a function of the amount  
of hardware you throw at it and the size of your network pipes.

There are currently 182 datanodes; 180 are "traditional" ones of <3TB  
and 2 are big honking RAID arrays of 40TB.  Transfers are load- 
balanced amongst ~ 7 GridFTP servers which each have 1Gbps connection.

Does that help?

Brian

On Feb 10, 2009, at 4:46 PM, Brian Bockelman wrote:

>
> On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:
>
>> Hi,
>>   Could someone help me to find some real Figures (transfer rate)  
>> about Hadoop File transfer  from local filesystem to HDFS, S3 etc  
>> and among Storage Systems (HDFS to S3 etc)
>>
>> Thanks,
>>
>> Wasim
>
> What are you looking for?  Maximum possible transfer rate?  Maximum  
> possible transfer rate per client?  Generally, if you're using the  
> Java client, transfer rate to/from HDFS is limited by the hardware  
> you have and the network connection (if you have 1Gbps per client).
>
> I could give you a graph showing a peak of 9Gbps from our Hadoop  
> instance to the WAN, but that's not very interesting if you don't  
> have a 10Gbps pipe...
>
> Brian

Re: File Transfer Rates

Posted by Mark Kerzner <ma...@gmail.com>.

Brian, I saw that Stuart
here<http://stuartsierra.com/2008/04/24/a-million-little-files>mentions
slow writes to SequenceFile. If so, I will either use his tar
approach or try to parallelize it if I can.

On Tue, Feb 10, 2009 at 11:14 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

>
> On Feb 10, 2009, at 11:09 PM, Mark Kerzner wrote:
>
>  Brian, large files using command-line hadoop go fast, so it is something
>> about my computer or network. I won't worry about this now, especially in
>> light of Amit reporting fast writes and reads.
>>
>
> You're creating files using SequenceFile, right?  It might be that the
> creation of the sequence file is the portion which is slow, not the network
> I/O.
>
> I don't have much knowledge about optimization of SequenceFile creation.  I
> assume that you'll want to start by tweaking compression on and off.
>  Additionally, Jeff (I think) pointed to a Hadoop Archive file, which also
> might be an alternative for your system.  I don't know enough to give you a
> set of pros and cons, just enough to mention it as an alternative to
> experiment with.
>
> Sorry I'm not useful here...
>
> Brian
>
>
>
>>
>> Mark
>>
>> On Tue, Feb 10, 2009 at 5:00 PM, Brian Bockelman <bbockelm@cse.unl.edu
>> >wrote:
>>
>>
>>> On Feb 10, 2009, at 4:53 PM, Mark Kerzner wrote:
>>>
>>> Brian, I have a similar question: why does transfer from a local
>>>
>>>> filesystem
>>>> to SequenceFile takes so long (about 1 second per Meg)?
>>>>
>>>>
>>> Hey Mark,
>>>
>>> I saw your question about speed the other day ... unfortunately, I didn't
>>> have any specific advice so I stayed quiet :)
>>>
>>> In a correctly configured cluster, performance is mostly limited by
>>> available hardware.  If it's obvious that performance is well below
>>> hardware
>>> limits (such as in your case), it's usually (a) you're not generating
>>> files
>>> fast enough or (b) something is configured wrong.
>>>
>>> Have you just tried hadoop fs -put .... for some large file hanging
>>> around
>>> locally?  If that doesn't go more than 5MB/s or so (when your hardware
>>> can
>>> obviously do such a rate), then there's probably a configuration issue.
>>>
>>> Brian
>>>
>>>
>>>
>>>  Thank you,
>>>> Mark
>>>>
>>>> On Tue, Feb 10, 2009 at 4:46 PM, Brian Bockelman <bbockelm@cse.unl.edu
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>>
>>>>  On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>  Could someone help me to find some real Figures (transfer rate) about
>>>>>> Hadoop File transfer  from local filesystem to HDFS, S3 etc and among
>>>>>> Storage Systems (HDFS to S3 etc)
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Wasim
>>>>>>
>>>>>>
>>>>>>  What are you looking for?  Maximum possible transfer rate?  Maximum
>>>>> possible transfer rate per client?  Generally, if you're using the Java
>>>>> client, transfer rate to/from HDFS is limited by the hardware you have
>>>>> and
>>>>> the network connection (if you have 1Gbps per client).
>>>>>
>>>>> I could give you a graph showing a peak of 9Gbps from our Hadoop
>>>>> instance
>>>>> to the WAN, but that's not very interesting if you don't have a 10Gbps
>>>>> pipe...
>>>>>
>>>>> Brian
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>

Re: File Transfer Rates

Posted by Brian Bockelman <bb...@cse.unl.edu>.

On Feb 10, 2009, at 11:09 PM, Mark Kerzner wrote:

> Brian, large files using command-line hadoop go fast, so it is  
> something
> about my computer or network. I won't worry about this now,  
> especially in
> light of Amit reporting fast writes and reads.

You're creating files using SequenceFile, right?  It might be that the  
creation of the sequence file is the portion which is slow, not the  
network I/O.

I don't have much knowledge about optimization of SequenceFile  
creation.  I assume that you'll want to start by tweaking compression  
on and off.  Additionally, Jeff (I think) pointed to a Hadoop Archive  
file, which also might be an alternative for your system.  I don't  
know enough to give you a set of pros and cons, just enough to mention  
it as an alternative to experiment with.

Sorry I'm not useful here...

Brian

>
>
> Mark
>
> On Tue, Feb 10, 2009 at 5:00 PM, Brian Bockelman  
> <bb...@cse.unl.edu>wrote:
>
>>
>> On Feb 10, 2009, at 4:53 PM, Mark Kerzner wrote:
>>
>> Brian, I have a similar question: why does transfer from a local
>>> filesystem
>>> to SequenceFile takes so long (about 1 second per Meg)?
>>>
>>
>> Hey Mark,
>>
>> I saw your question about speed the other day ... unfortunately, I  
>> didn't
>> have any specific advice so I stayed quiet :)
>>
>> In a correctly configured cluster, performance is mostly limited by
>> available hardware.  If it's obvious that performance is well below  
>> hardware
>> limits (such as in your case), it's usually (a) you're not  
>> generating files
>> fast enough or (b) something is configured wrong.
>>
>> Have you just tried hadoop fs -put .... for some large file hanging  
>> around
>> locally?  If that doesn't go more than 5MB/s or so (when your  
>> hardware can
>> obviously do such a rate), then there's probably a configuration  
>> issue.
>>
>> Brian
>>
>>
>>
>>> Thank you,
>>> Mark
>>>
>>> On Tue, Feb 10, 2009 at 4:46 PM, Brian Bockelman <bbockelm@cse.unl.edu
>>>> wrote:
>>>
>>>
>>>> On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:
>>>>
>>>> Hi,
>>>>
>>>>> Could someone help me to find some real Figures (transfer rate)  
>>>>> about
>>>>> Hadoop File transfer  from local filesystem to HDFS, S3 etc and  
>>>>> among
>>>>> Storage Systems (HDFS to S3 etc)
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Wasim
>>>>>
>>>>>
>>>> What are you looking for?  Maximum possible transfer rate?  Maximum
>>>> possible transfer rate per client?  Generally, if you're using  
>>>> the Java
>>>> client, transfer rate to/from HDFS is limited by the hardware you  
>>>> have
>>>> and
>>>> the network connection (if you have 1Gbps per client).
>>>>
>>>> I could give you a graph showing a peak of 9Gbps from our Hadoop  
>>>> instance
>>>> to the WAN, but that's not very interesting if you don't have a  
>>>> 10Gbps
>>>> pipe...
>>>>
>>>> Brian
>>>>
>>>>
>>>>
>>

Re: File Transfer Rates

Posted by Mark Kerzner <ma...@gmail.com>.

Brian, large files using command-line hadoop go fast, so it is something
about my computer or network. I won't worry about this now, especially in
light of Amit reporting fast writes and reads.

Mark

On Tue, Feb 10, 2009 at 5:00 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

>
> On Feb 10, 2009, at 4:53 PM, Mark Kerzner wrote:
>
>  Brian, I have a similar question: why does transfer from a local
>> filesystem
>> to SequenceFile takes so long (about 1 second per Meg)?
>>
>
> Hey Mark,
>
> I saw your question about speed the other day ... unfortunately, I didn't
> have any specific advice so I stayed quiet :)
>
> In a correctly configured cluster, performance is mostly limited by
> available hardware.  If it's obvious that performance is well below hardware
> limits (such as in your case), it's usually (a) you're not generating files
> fast enough or (b) something is configured wrong.
>
> Have you just tried hadoop fs -put .... for some large file hanging around
> locally?  If that doesn't go more than 5MB/s or so (when your hardware can
> obviously do such a rate), then there's probably a configuration issue.
>
> Brian
>
>
>
>> Thank you,
>> Mark
>>
>> On Tue, Feb 10, 2009 at 4:46 PM, Brian Bockelman <bbockelm@cse.unl.edu
>> >wrote:
>>
>>
>>> On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:
>>>
>>> Hi,
>>>
>>>>  Could someone help me to find some real Figures (transfer rate) about
>>>> Hadoop File transfer  from local filesystem to HDFS, S3 etc and among
>>>> Storage Systems (HDFS to S3 etc)
>>>>
>>>> Thanks,
>>>>
>>>> Wasim
>>>>
>>>>
>>> What are you looking for?  Maximum possible transfer rate?  Maximum
>>> possible transfer rate per client?  Generally, if you're using the Java
>>> client, transfer rate to/from HDFS is limited by the hardware you have
>>> and
>>> the network connection (if you have 1Gbps per client).
>>>
>>> I could give you a graph showing a peak of 9Gbps from our Hadoop instance
>>> to the WAN, but that's not very interesting if you don't have a 10Gbps
>>> pipe...
>>>
>>> Brian
>>>
>>>
>>>
>

Re: File Transfer Rates

Posted by Brian Bockelman <bb...@cse.unl.edu>.

On Feb 10, 2009, at 4:53 PM, Mark Kerzner wrote:

> Brian, I have a similar question: why does transfer from a local  
> filesystem
> to SequenceFile takes so long (about 1 second per Meg)?

Hey Mark,

I saw your question about speed the other day ... unfortunately, I  
didn't have any specific advice so I stayed quiet :)

In a correctly configured cluster, performance is mostly limited by  
available hardware.  If it's obvious that performance is well below  
hardware limits (such as in your case), it's usually (a) you're not  
generating files fast enough or (b) something is configured wrong.

Have you just tried hadoop fs -put .... for some large file hanging  
around locally?  If that doesn't go more than 5MB/s or so (when your  
hardware can obviously do such a rate), then there's probably a  
configuration issue.

Brian

>
> Thank you,
> Mark
>
> On Tue, Feb 10, 2009 at 4:46 PM, Brian Bockelman  
> <bb...@cse.unl.edu>wrote:
>
>>
>> On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:
>>
>> Hi,
>>>  Could someone help me to find some real Figures (transfer rate)  
>>> about
>>> Hadoop File transfer  from local filesystem to HDFS, S3 etc and  
>>> among
>>> Storage Systems (HDFS to S3 etc)
>>>
>>> Thanks,
>>>
>>> Wasim
>>>
>>
>> What are you looking for?  Maximum possible transfer rate?  Maximum
>> possible transfer rate per client?  Generally, if you're using the  
>> Java
>> client, transfer rate to/from HDFS is limited by the hardware you  
>> have and
>> the network connection (if you have 1Gbps per client).
>>
>> I could give you a graph showing a peak of 9Gbps from our Hadoop  
>> instance
>> to the WAN, but that's not very interesting if you don't have a  
>> 10Gbps
>> pipe...
>>
>> Brian
>>
>>

Re: File Transfer Rates

Posted by Amit Chandel <am...@gmail.com>.

With my setup. I have been able to get  10MBps write speed, 40MBps read
speed  while writing multiple files (ranging a few Bytes to 100MB) into
SequenceFiles, and reading them back. The cluster has 1Gbps backbone.

On Tue, Feb 10, 2009 at 5:53 PM, Mark Kerzner <ma...@gmail.com> wrote:

> Brian, I have a similar question: why does transfer from a local filesystem
> to SequenceFile takes so long (about 1 second per Meg)?
> Thank you,
> Mark
>
> On Tue, Feb 10, 2009 at 4:46 PM, Brian Bockelman <bbockelm@cse.unl.edu
> >wrote:
>
> >
> > On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:
> >
> >  Hi,
> >>   Could someone help me to find some real Figures (transfer rate) about
> >> Hadoop File transfer  from local filesystem to HDFS, S3 etc and among
> >> Storage Systems (HDFS to S3 etc)
> >>
> >> Thanks,
> >>
> >> Wasim
> >>
> >
> > What are you looking for?  Maximum possible transfer rate?  Maximum
> > possible transfer rate per client?  Generally, if you're using the Java
> > client, transfer rate to/from HDFS is limited by the hardware you have
> and
> > the network connection (if you have 1Gbps per client).
> >
> > I could give you a graph showing a peak of 9Gbps from our Hadoop instance
> > to the WAN, but that's not very interesting if you don't have a 10Gbps
> > pipe...
> >
> > Brian
> >
> >
>

Re: File Transfer Rates

Posted by Mark Kerzner <ma...@gmail.com>.

Brian, I have a similar question: why does transfer from a local filesystem
to SequenceFile takes so long (about 1 second per Meg)?
Thank you,
Mark

On Tue, Feb 10, 2009 at 4:46 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

>
> On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:
>
>  Hi,
>>   Could someone help me to find some real Figures (transfer rate) about
>> Hadoop File transfer  from local filesystem to HDFS, S3 etc and among
>> Storage Systems (HDFS to S3 etc)
>>
>> Thanks,
>>
>> Wasim
>>
>
> What are you looking for?  Maximum possible transfer rate?  Maximum
> possible transfer rate per client?  Generally, if you're using the Java
> client, transfer rate to/from HDFS is limited by the hardware you have and
> the network connection (if you have 1Gbps per client).
>
> I could give you a graph showing a peak of 9Gbps from our Hadoop instance
> to the WAN, but that's not very interesting if you don't have a 10Gbps
> pipe...
>
> Brian
>
>

Re: File Transfer Rates

Posted by Brian Bockelman <bb...@cse.unl.edu>.

On Feb 10, 2009, at 4:10 PM, Wasim Bari wrote:

> Hi,
>    Could someone help me to find some real Figures (transfer rate)  
> about Hadoop File transfer  from local filesystem to HDFS, S3 etc  
> and among Storage Systems (HDFS to S3 etc)
>
> Thanks,
>
> Wasim

What are you looking for?  Maximum possible transfer rate?  Maximum  
possible transfer rate per client?  Generally, if you're using the  
Java client, transfer rate to/from HDFS is limited by the hardware you  
have and the network connection (if you have 1Gbps per client).

I could give you a graph showing a peak of 9Gbps from our Hadoop  
instance to the WAN, but that's not very interesting if you don't have  
a 10Gbps pipe...

Brian