You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Martin Mituzas <xi...@hotmail.com> on 2009/11/24 08:23:22 UTC

why does not hdfs read ahead ?

I read the code and find the call
DFSInputStream.read(buf, off, len)
will cause the DataNode read len bytes (or less if encounting the end of
block) , why does not hdfs read ahead to improve performance for sequential
read?
-- 
View this message in context: http://old.nabble.com/why-does-not-hdfs-read-ahead---tp26491449p26491449.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: why does not hdfs read ahead ?

Posted by Todd Lipcon <to...@cloudera.com>.

On Tue, Nov 24, 2009 at 10:35 AM, Raghu Angadi <an...@gmail.com> wrote:

> Sequential read is the simplest case and it is pretty hard to improve upon
> the current raw performance (HDFS client does take more CPU than one might
> expect, Todd implemented an improvement for CPU consumed).
>
>
Just to reiterate what Todd said, there is an implicit read ahead for
> sequential reads with TCP buffers and kernel read ahead on Datanodes.
>
>
The one thing that explicit readahead may benefit for us is dealing with the
fact that Linux's readahead implementation does very poorly with detecting
readahead when you have multiple parallel sequential readers on the same
block device. This is often the case with Hadoop, and the default schedulers
do a pretty bad job of it. Explicitly doing your own readahead allows the
scheduler to do a better job of avoiding seeks, and you can overlap CPU and
IO much better. I think this would benefit the various Mergers in
particular.

-Todd


> If you extend the read ahead buffer to be more of a buffer cache for the
> block, it could have big impact for some read access patterns (e.g. binary
> search).
>
> Raghu.
>
> On Mon, Nov 23, 2009 at 11:23 PM, Martin Mituzas <xietao1981@hotmail.com
> >wrote:
>
> >
> > I read the code and find the call
> > DFSInputStream.read(buf, off, len)
> > will cause the DataNode read len bytes (or less if encounting the end of
> > block) , why does not hdfs read ahead to improve performance for
> sequential
> > read?
> > --
> > View this message in context:
> >
> http://old.nabble.com/why-does-not-hdfs-read-ahead---tp26491449p26491449.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>

Re: why does not hdfs read ahead ?

Posted by Raghu Angadi <an...@gmail.com>.

I am certainly interested in where this experiment leads. I am sure many on
the list would be interested too.

Using native Java API would certainly simplify things (but not required).

To find the bottleneck, I would look in obvious places first :
 1. cpu on the client
 2. network (netstat on one of the datanodes and the client would be good)
 3. disk I/O on datanodes (iostat -x)

Is the experimental set up described more in detail some where?

With very high b/w networks, TCP buffer sizes could be a factor even with
LAN latencies.

A jira would also be a good place to discuss details.

Raghu.

On Tue, Nov 24, 2009 at 3:35 PM, Michael Thomas <th...@hep.caltech.edu>wrote:

> Hey guys,
>
> During the SC09 exercise, our data transfer tool was using the FUSE
> interface to HDFS.  As Brian said, we were also reading 16 files in
> parallel.  This seemed to be the optimal number, beyond which the aggregate
> read rate did not improve.
>
> We have worked scheduled to modify our data transfer tool to use the native
> hadoop java APIs, as well as running some additional tests offline to see if
> the HDFS-FUSE interface is the bottleneck as we suspect.
>
> Regards,
>
> --Mike
>
>
> On 11/24/2009 03:01 PM, Brian Bockelman wrote:
>
>> Hey Raghu,
>>
>> There are a few performance issues.  Last week during Supercomputing '09,
>> Caltech was having issues with getting more than 2.6 Gbps per HDFS client
>> process (I think they were pulling 16 files per process, but Mike knows the
>> details).  I think they'd appreciate any advice you have about tuning HDFS
>> performance.
>>
>> We're starting early R&D for 100Gbps dataflows, and I believe improving
>> our current HDFS performance is on the TODO list.
>>
>> Brian
>>
>> (PS - I'm not saying HDFS is at fault here - it always remains a
>> possibility that we're using it in a sub-optimal manner.  If you have any
>> favorite Java performance instrumentation to recommend, we'd also be
>> interested in that.)
>>
>> On Nov 24, 2009, at 12:35 PM, Raghu Angadi wrote:
>>
>>  Sequential read is the simplest case and it is pretty hard to improve
>>> upon
>>> the current raw performance (HDFS client does take more CPU than one
>>> might
>>> expect, Todd implemented an improvement for CPU consumed).
>>>
>>> Just to reiterate what Todd said, there is an implicit read ahead for
>>> sequential reads with TCP buffers and kernel read ahead on Datanodes.
>>>
>>> If you extend the read ahead buffer to be more of a buffer cache for the
>>> block, it could have big impact for some read access patterns (e.g.
>>> binary
>>> search).
>>>
>>> Raghu.
>>>
>>> On Mon, Nov 23, 2009 at 11:23 PM, Martin Mituzas<xietao1981@hotmail.com
>>> >wrote:
>>>
>>>
>>>> I read the code and find the call
>>>> DFSInputStream.read(buf, off, len)
>>>> will cause the DataNode read len bytes (or less if encounting the end of
>>>> block) , why does not hdfs read ahead to improve performance for
>>>> sequential
>>>> read?
>>>> --
>>>> View this message in context:
>>>>
>>>> http://old.nabble.com/why-does-not-hdfs-read-ahead---tp26491449p26491449.html
>>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>>
>>>>
>>>>
>>
>
>

Re: why does not hdfs read ahead ?

Posted by Steve Loughran <st...@apache.org>.

Michael Thomas wrote:
> Hey guys,
> 
> During the SC09 exercise, our data transfer tool was using the FUSE 
> interface to HDFS.  As Brian said, we were also reading 16 files in 
> parallel.  This seemed to be the optimal number, beyond which the 
> aggregate read rate did not improve.
> 
> We have worked scheduled to modify our data transfer tool to use the 
> native hadoop java APIs, as well as running some additional tests 
> offline to see if the HDFS-FUSE interface is the bottleneck as we suspect.
> 
> Regards,
> 
> --Mike

Was this all local data?

IN Russ Perry's little paper "High Speed Raster Image Streaming For 
Digital Presses Using the Hadoop File System", he got 4Gb/s over the LAN 
by having a client app deciding which datanode to pull each block from, 
rather than having the NN tell them which node to ask for which block

"Measured stream rates approaching 4Gb/s were achieved which is close to 
the required rate for streaming pages containing rich designs to a 
digital press. This required only a minor extension to the Hadoop client 
to allow file blocks to be read in parallel from the Hadoop data nodes."

http://www.hpl.hp.com/techreports/2009/HPL-2009-345.html

Re: why does not hdfs read ahead ?

Posted by Michael Thomas <th...@hep.caltech.edu>.

Hey guys,

During the SC09 exercise, our data transfer tool was using the FUSE 
interface to HDFS.  As Brian said, we were also reading 16 files in 
parallel.  This seemed to be the optimal number, beyond which the 
aggregate read rate did not improve.

We have worked scheduled to modify our data transfer tool to use the 
native hadoop java APIs, as well as running some additional tests 
offline to see if the HDFS-FUSE interface is the bottleneck as we suspect.

Regards,

--Mike

On 11/24/2009 03:01 PM, Brian Bockelman wrote:
> Hey Raghu,
>
> There are a few performance issues.  Last week during Supercomputing '09, Caltech was having issues with getting more than 2.6 Gbps per HDFS client process (I think they were pulling 16 files per process, but Mike knows the details).  I think they'd appreciate any advice you have about tuning HDFS performance.
>
> We're starting early R&D for 100Gbps dataflows, and I believe improving our current HDFS performance is on the TODO list.
>
> Brian
>
> (PS - I'm not saying HDFS is at fault here - it always remains a possibility that we're using it in a sub-optimal manner.  If you have any favorite Java performance instrumentation to recommend, we'd also be interested in that.)
>
> On Nov 24, 2009, at 12:35 PM, Raghu Angadi wrote:
>
>> Sequential read is the simplest case and it is pretty hard to improve upon
>> the current raw performance (HDFS client does take more CPU than one might
>> expect, Todd implemented an improvement for CPU consumed).
>>
>> Just to reiterate what Todd said, there is an implicit read ahead for
>> sequential reads with TCP buffers and kernel read ahead on Datanodes.
>>
>> If you extend the read ahead buffer to be more of a buffer cache for the
>> block, it could have big impact for some read access patterns (e.g. binary
>> search).
>>
>> Raghu.
>>
>> On Mon, Nov 23, 2009 at 11:23 PM, Martin Mituzas<xi...@hotmail.com>wrote:
>>
>>>
>>> I read the code and find the call
>>> DFSInputStream.read(buf, off, len)
>>> will cause the DataNode read len bytes (or less if encounting the end of
>>> block) , why does not hdfs read ahead to improve performance for sequential
>>> read?
>>> --
>>> View this message in context:
>>> http://old.nabble.com/why-does-not-hdfs-read-ahead---tp26491449p26491449.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>
>>>
>

Re: why does not hdfs read ahead ?

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hey Raghu,

There are a few performance issues.  Last week during Supercomputing '09, Caltech was having issues with getting more than 2.6 Gbps per HDFS client process (I think they were pulling 16 files per process, but Mike knows the details).  I think they'd appreciate any advice you have about tuning HDFS performance.

We're starting early R&D for 100Gbps dataflows, and I believe improving our current HDFS performance is on the TODO list.

Brian

(PS - I'm not saying HDFS is at fault here - it always remains a possibility that we're using it in a sub-optimal manner.  If you have any favorite Java performance instrumentation to recommend, we'd also be interested in that.)

On Nov 24, 2009, at 12:35 PM, Raghu Angadi wrote:

> Sequential read is the simplest case and it is pretty hard to improve upon
> the current raw performance (HDFS client does take more CPU than one might
> expect, Todd implemented an improvement for CPU consumed).
> 
> Just to reiterate what Todd said, there is an implicit read ahead for
> sequential reads with TCP buffers and kernel read ahead on Datanodes.
> 
> If you extend the read ahead buffer to be more of a buffer cache for the
> block, it could have big impact for some read access patterns (e.g. binary
> search).
> 
> Raghu.
> 
> On Mon, Nov 23, 2009 at 11:23 PM, Martin Mituzas <xi...@hotmail.com>wrote:
> 
>> 
>> I read the code and find the call
>> DFSInputStream.read(buf, off, len)
>> will cause the DataNode read len bytes (or less if encounting the end of
>> block) , why does not hdfs read ahead to improve performance for sequential
>> read?
>> --
>> View this message in context:
>> http://old.nabble.com/why-does-not-hdfs-read-ahead---tp26491449p26491449.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>> 
>>

Re: why does not hdfs read ahead ?

Posted by Raghu Angadi <an...@gmail.com>.

Sequential read is the simplest case and it is pretty hard to improve upon
the current raw performance (HDFS client does take more CPU than one might
expect, Todd implemented an improvement for CPU consumed).

Just to reiterate what Todd said, there is an implicit read ahead for
sequential reads with TCP buffers and kernel read ahead on Datanodes.

If you extend the read ahead buffer to be more of a buffer cache for the
block, it could have big impact for some read access patterns (e.g. binary
search).

Raghu.

On Mon, Nov 23, 2009 at 11:23 PM, Martin Mituzas <xi...@hotmail.com>wrote:

>
> I read the code and find the call
> DFSInputStream.read(buf, off, len)
> will cause the DataNode read len bytes (or less if encounting the end of
> block) , why does not hdfs read ahead to improve performance for sequential
> read?
> --
> View this message in context:
> http://old.nabble.com/why-does-not-hdfs-read-ahead---tp26491449p26491449.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: why does not hdfs read ahead ?

Posted by Brian Bockelman <bb...@cse.unl.edu>.

On Nov 24, 2009, at 12:36 PM, Todd Lipcon wrote:

> On Tue, Nov 24, 2009 at 10:33 AM, Brian Bockelman <bb...@cse.unl.edu>wrote:
> 
>> 
>> On Nov 24, 2009, at 12:06 PM, Todd Lipcon wrote:
>> 
>>> Also, keep in mind that, when you open a block for reading, the DN
>>> immediately starts writing the entire block (assuming it's requested via
>> the
>>> xceiver protocol) - it's TCP backpressure on the send window that does
>> flow
>>> control there.
>> 
>> Ok, that's a pretty freakin' cool idea.  Is it well-documented how this
>> technique works?  How does this affect folks (me) who use the pread
>> interface?
>> 
> 
> AFAIK using pread sends the actual length with the OP_READ_BLOCK command, so
> it doesn't read ahead past what you ask for. The awful thing about pread is
> that it actually makes a new datanode connection for every read - including
> the TCP handshake round trip, thread setup/teardown, etc.
> 

I'm not going to argue with the fact that we can do better here, but it's not as bad as you think for our particular workflow.  Our random reads are "truly random"; i.e., there are approximately zero repeated requests of data.  Hence, the 1ms of overhead is pretty negligible compared to spinning a hard drive (10ms when the cluster is idle, 30ms when we're pounding it).

In future versions of our software, we've made things at least "monotonically increasing".  I.e., with a few exceptions, every position is strictly greater than the position of the last read.  (It doesn't mean we can sequentially read out the file; our reads can be quite sparse, only taking 10% of the file; if we read things sequentially, we'd overread by a factor of 10, and that can start to hit network limitations).

At some point, I need to do a talk or write-up of the column-oriented techniques that HEP folks do; after all, they've been doing column-oriented stores for the past 20 years or so.  They have some tricks up their sleeves, and it would be interesting to compare notes.

Brian

> 
>> 
>>> So, although it's not explicitly reading ahead, most of the
>>> reads on DFSInputStream should be coming from the TCP receive buffer, not
>>> making round trips.
>>> 
>>> At one point a few weeks ago I did hack explicit readahead around
>>> DFSInputStream and didn't see an appreciable difference. I didn't spend
>> much
>>> time on it, though, so I may have screwed something up - wasn't a
>> scientific
>>> test.
>>> 
>> 
>> Speaking from someone who's worked with storage systems that do an explicit
>> readahead, this can turn out to be a big giant disaster if it's combined
>> with random reads.
>> 
>> Big disaster as far as application-level throughput goes - but does make
>> for impressive ganglia graphs!
>> 
>> Brian
>> 
>>> -Todd
>>> 
>>> On Tue, Nov 24, 2009 at 10:02 AM, Eli Collins <el...@cloudera.com> wrote:
>>> 
>>>> Hey Martin,
>>>> 
>>>> It would be an interesting experiment but I'm not sure it would
>>>> improve things as the host (and hardware to some extent) are already
>>>> reading ahead. A useful exercise would be to evaluate whether the new
>>>> default host parameters for on-demand readahead are suitable for
>>>> hadoop.
>>>> 
>>>> http://lwn.net/Articles/235164
>>>> http://lwn.net/Articles/235181
>>>> 
>>>> Thanks,
>>>> Eli
>>>> 
>>>> On Mon, Nov 23, 2009 at 11:23 PM, Martin Mituzas <
>> xietao1981@hotmail.com>
>>>> wrote:
>>>>> 
>>>>> I read the code and find the call
>>>>> DFSInputStream.read(buf, off, len)
>>>>> will cause the DataNode read len bytes (or less if encounting the end
>> of
>>>>> block) , why does not hdfs read ahead to improve performance for
>>>> sequential
>>>>> read?
>>>>> --
>>>>> View this message in context:
>>>> 
>> http://old.nabble.com/why-does-not-hdfs-read-ahead---tp26491449p26491449.html
>>>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: why does not hdfs read ahead ?

Posted by Todd Lipcon <to...@cloudera.com>.

On Tue, Nov 24, 2009 at 10:33 AM, Brian Bockelman <bb...@cse.unl.edu>wrote:

>
> On Nov 24, 2009, at 12:06 PM, Todd Lipcon wrote:
>
> > Also, keep in mind that, when you open a block for reading, the DN
> > immediately starts writing the entire block (assuming it's requested via
> the
> > xceiver protocol) - it's TCP backpressure on the send window that does
> flow
> > control there.
>
> Ok, that's a pretty freakin' cool idea.  Is it well-documented how this
> technique works?  How does this affect folks (me) who use the pread
> interface?
>

AFAIK using pread sends the actual length with the OP_READ_BLOCK command, so
it doesn't read ahead past what you ask for. The awful thing about pread is
that it actually makes a new datanode connection for every read - including
the TCP handshake round trip, thread setup/teardown, etc.


>
> > So, although it's not explicitly reading ahead, most of the
> > reads on DFSInputStream should be coming from the TCP receive buffer, not
> > making round trips.
> >
> > At one point a few weeks ago I did hack explicit readahead around
> > DFSInputStream and didn't see an appreciable difference. I didn't spend
> much
> > time on it, though, so I may have screwed something up - wasn't a
> scientific
> > test.
> >
>
> Speaking from someone who's worked with storage systems that do an explicit
> readahead, this can turn out to be a big giant disaster if it's combined
> with random reads.
>
> Big disaster as far as application-level throughput goes - but does make
> for impressive ganglia graphs!
>
> Brian
>
> > -Todd
> >
> > On Tue, Nov 24, 2009 at 10:02 AM, Eli Collins <el...@cloudera.com> wrote:
> >
> >> Hey Martin,
> >>
> >> It would be an interesting experiment but I'm not sure it would
> >> improve things as the host (and hardware to some extent) are already
> >> reading ahead. A useful exercise would be to evaluate whether the new
> >> default host parameters for on-demand readahead are suitable for
> >> hadoop.
> >>
> >> http://lwn.net/Articles/235164
> >> http://lwn.net/Articles/235181
> >>
> >> Thanks,
> >> Eli
> >>
> >> On Mon, Nov 23, 2009 at 11:23 PM, Martin Mituzas <
> xietao1981@hotmail.com>
> >> wrote:
> >>>
> >>> I read the code and find the call
> >>> DFSInputStream.read(buf, off, len)
> >>> will cause the DataNode read len bytes (or less if encounting the end
> of
> >>> block) , why does not hdfs read ahead to improve performance for
> >> sequential
> >>> read?
> >>> --
> >>> View this message in context:
> >>
> http://old.nabble.com/why-does-not-hdfs-read-ahead---tp26491449p26491449.html
> >>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>>
> >>>
> >>
>
>

Re: why does not hdfs read ahead ?

Posted by Brian Bockelman <bb...@cse.unl.edu>.

On Nov 24, 2009, at 12:06 PM, Todd Lipcon wrote:

> Also, keep in mind that, when you open a block for reading, the DN
> immediately starts writing the entire block (assuming it's requested via the
> xceiver protocol) - it's TCP backpressure on the send window that does flow
> control there.

Ok, that's a pretty freakin' cool idea.  Is it well-documented how this technique works?  How does this affect folks (me) who use the pread interface?

> So, although it's not explicitly reading ahead, most of the
> reads on DFSInputStream should be coming from the TCP receive buffer, not
> making round trips.
> 
> At one point a few weeks ago I did hack explicit readahead around
> DFSInputStream and didn't see an appreciable difference. I didn't spend much
> time on it, though, so I may have screwed something up - wasn't a scientific
> test.
> 

Speaking from someone who's worked with storage systems that do an explicit readahead, this can turn out to be a big giant disaster if it's combined with random reads.

Big disaster as far as application-level throughput goes - but does make for impressive ganglia graphs!

Brian

> -Todd
> 
> On Tue, Nov 24, 2009 at 10:02 AM, Eli Collins <el...@cloudera.com> wrote:
> 
>> Hey Martin,
>> 
>> It would be an interesting experiment but I'm not sure it would
>> improve things as the host (and hardware to some extent) are already
>> reading ahead. A useful exercise would be to evaluate whether the new
>> default host parameters for on-demand readahead are suitable for
>> hadoop.
>> 
>> http://lwn.net/Articles/235164
>> http://lwn.net/Articles/235181
>> 
>> Thanks,
>> Eli
>> 
>> On Mon, Nov 23, 2009 at 11:23 PM, Martin Mituzas <xi...@hotmail.com>
>> wrote:
>>> 
>>> I read the code and find the call
>>> DFSInputStream.read(buf, off, len)
>>> will cause the DataNode read len bytes (or less if encounting the end of
>>> block) , why does not hdfs read ahead to improve performance for
>> sequential
>>> read?
>>> --
>>> View this message in context:
>> http://old.nabble.com/why-does-not-hdfs-read-ahead---tp26491449p26491449.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>> 
>>> 
>>

Re: why does not hdfs read ahead ?

Posted by Todd Lipcon <to...@cloudera.com>.

Also, keep in mind that, when you open a block for reading, the DN
immediately starts writing the entire block (assuming it's requested via the
xceiver protocol) - it's TCP backpressure on the send window that does flow
control there. So, although it's not explicitly reading ahead, most of the
reads on DFSInputStream should be coming from the TCP receive buffer, not
making round trips.

At one point a few weeks ago I did hack explicit readahead around
DFSInputStream and didn't see an appreciable difference. I didn't spend much
time on it, though, so I may have screwed something up - wasn't a scientific
test.

-Todd

On Tue, Nov 24, 2009 at 10:02 AM, Eli Collins <el...@cloudera.com> wrote:

> Hey Martin,
>
> It would be an interesting experiment but I'm not sure it would
> improve things as the host (and hardware to some extent) are already
> reading ahead. A useful exercise would be to evaluate whether the new
> default host parameters for on-demand readahead are suitable for
> hadoop.
>
> http://lwn.net/Articles/235164
> http://lwn.net/Articles/235181
>
> Thanks,
> Eli
>
> On Mon, Nov 23, 2009 at 11:23 PM, Martin Mituzas <xi...@hotmail.com>
> wrote:
> >
> > I read the code and find the call
> > DFSInputStream.read(buf, off, len)
> > will cause the DataNode read len bytes (or less if encounting the end of
> > block) , why does not hdfs read ahead to improve performance for
> sequential
> > read?
> > --
> > View this message in context:
> http://old.nabble.com/why-does-not-hdfs-read-ahead---tp26491449p26491449.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>

Re: why does not hdfs read ahead ?

Posted by Eli Collins <el...@cloudera.com>.

Hey Martin,

It would be an interesting experiment but I'm not sure it would
improve things as the host (and hardware to some extent) are already
reading ahead. A useful exercise would be to evaluate whether the new
default host parameters for on-demand readahead are suitable for
hadoop.

http://lwn.net/Articles/235164
http://lwn.net/Articles/235181

Thanks,
Eli

On Mon, Nov 23, 2009 at 11:23 PM, Martin Mituzas <xi...@hotmail.com> wrote:
>
> I read the code and find the call
> DFSInputStream.read(buf, off, len)
> will cause the DataNode read len bytes (or less if encounting the end of
> block) , why does not hdfs read ahead to improve performance for sequential
> read?
> --
> View this message in context: http://old.nabble.com/why-does-not-hdfs-read-ahead---tp26491449p26491449.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>