You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by Eitan Rosenfeld <ei...@gmail.com> on 2014/11/04 15:19:35 UTC

Why do reads take as long as replicated writes?

I am benchmarking my cluster of 16 nodes (all in one rack) with TestDFSIO on
Hadoop 1.0.4.  For simplicity, I turned off speculative task execution and set
the max map and reduce tasks to 1.

With a replication factor of 2, writing 1 file of 5GB takes twice as long as
reading 1 file. This result seems to make sense since the replication results
in twice the I/O in the cluster versus the read. However, as I scale up the
number of 5GB files from 1 to 64 files, reading ultimately takes as long as
writing. In particular, I see this result when writing and reading 64
such files.

What could cause read performance to degrade faster than write performance
as the number of files increases?

The full results (number of 5GB files, ratio of write time to read
time) are below:
1,  2.02
2,  1.87
4,  1.73
8,  1.54
16,  1.37
32,  1.29
64,  1.01

Thank you,

Eitan

Re: Why do reads take as long as replicated writes?

Posted by Colin McCabe <cm...@alumni.cmu.edu>.

I strongly suggest benchmarking a modern version of Hadoop rather than
Hadoop 1.x.  The native CRC stuff from HDFS-3528 greatly reduces CPU
consumption on the read path.  I wrote about some other read path
optimizations in Hadoop 2.x here:
http://www.club.cc.cmu.edu/~cmccabe/d/2014.04_ApacheCon_HDFS_read_path_optimization_presentation.pdf
. I agree with Andrew that Teragen and Teravalidate are probably a
better choice for you.  Look for the bottleneck in your system.

best,
Colin

On Wed, Nov 5, 2014 at 4:10 PM, Eitan Rosenfeld <ei...@gmail.com> wrote:
> Daemeon - Indeed, I neglected to mention that I am clearing the caches
> throughout my cluster before running the read benchmark. My expectation
> was to ideally get results that were proportionate to disk I/O, given
> that replicated writes perform twice the disk I/O relative to reads. I've
> verified the I/O with iostat. However, as I mentioned earlier, reads and
> writes converge as the number of files in the workload increases, despite
> the constant ratio of write I/O to read I/O.
>
> Andrew - I've verified that the network is not the bottleneck. (All of the
> links are 10Gb). As you'll see, I suspect that the lack of data-locality
> causes the slowdown because a given node can be responsible for
> serving multiple remote block reads all at once.
>
> I hope my understanding of writes and reads can be confirmed:
>
> Write pipelining allows a node to write, replicate, and receive replicated
> data in parallel. If node A is writing its own data while receiving
> replicated data from node B, node B does not wait for node A to finish
> writing B's replicated data to disk. Rather, node B can begin writing its
> next local block immediately.  Thus, pipelining helps replicated writes
> have good performance.
>
> In contrast, let's assume node A is currently reading a block. If node A
> receives an additional read request from node B, A will take longer to
> serve the block to B because of A's pre-existing read. Because node B
> waits longer for the block to be served from A, there is a delay on node B
> before it attempts to read the next block in the file. Multiple read
> requests from different nodes are a consequence of having no built-in
> data locality with TestDFSIO. Finally, as the number of concurrent tasks
> throughout the cluster increases, the wait time for reads increases.
>
> Is my understanding of these read and write mechanisms correct?
>
> Thank you,
> Eitan

Re: Why do reads take as long as replicated writes?

Posted by Eitan Rosenfeld <ei...@gmail.com>.

Daemeon - Indeed, I neglected to mention that I am clearing the caches
throughout my cluster before running the read benchmark. My expectation
was to ideally get results that were proportionate to disk I/O, given
that replicated writes perform twice the disk I/O relative to reads. I've
verified the I/O with iostat. However, as I mentioned earlier, reads and
writes converge as the number of files in the workload increases, despite
the constant ratio of write I/O to read I/O.

Andrew - I've verified that the network is not the bottleneck. (All of the
links are 10Gb). As you'll see, I suspect that the lack of data-locality
causes the slowdown because a given node can be responsible for
serving multiple remote block reads all at once.

I hope my understanding of writes and reads can be confirmed:

Write pipelining allows a node to write, replicate, and receive replicated
data in parallel. If node A is writing its own data while receiving
replicated data from node B, node B does not wait for node A to finish
writing B's replicated data to disk. Rather, node B can begin writing its
next local block immediately.  Thus, pipelining helps replicated writes
have good performance.

In contrast, let's assume node A is currently reading a block. If node A
receives an additional read request from node B, A will take longer to
serve the block to B because of A's pre-existing read. Because node B
waits longer for the block to be served from A, there is a delay on node B
before it attempts to read the next block in the file. Multiple read
requests from different nodes are a consequence of having no built-in
data locality with TestDFSIO. Finally, as the number of concurrent tasks
throughout the cluster increases, the wait time for reads increases.

Is my understanding of these read and write mechanisms correct?

Thank you,
Eitan

Re: Why do reads take as long as replicated writes?

Posted by daemeon reiydelle <da...@gmail.com>.

Reads can be faster than writes for smaller bursts of IO in part due to
disk and memory caching of reads (if you turn on write back (not
recommended!) your numbers above are likely to get closer together). As
your volume of IO increases, you tend to reach a point where you are bound
(more or less) by physical IO and are not leveraging the cache optimization
any more. FYI, if what you are seeing are these cache misses, then reducing
the percent of memory used by UNIX for file system buffers should result in
your observed phenomenon occurring sooner.

If you look at iostats you may see that the read and write service times on
the devices are converging, which is another indication of cache misses.

*.......“The race is not to the swift,nor the battle to the strong,but to
those who can see it coming and jump aside.” - Hunter ThompsonDaemeon C.M.
ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Tue, Nov 4, 2014 at 3:42 PM, Andrew Wang <an...@cloudera.com>
wrote:

> I would advise against using TestDFSIO, instead trying TeraGen and
> TeraValidate. IIRC TestDFSIO doesn't actually schedule for task locality,
> so it's not very good if you have a cluster bigger than your replication
> factor. You might be network bound as you try to read more files.
>
> Best,
> Andrew
>
> On Tue, Nov 4, 2014 at 6:19 AM, Eitan Rosenfeld <ei...@gmail.com> wrote:
>
> > I am benchmarking my cluster of 16 nodes (all in one rack) with TestDFSIO
> > on
> > Hadoop 1.0.4.  For simplicity, I turned off speculative task execution
> and
> > set
> > the max map and reduce tasks to 1.
> >
> > With a replication factor of 2, writing 1 file of 5GB takes twice as long
> > as
> > reading 1 file. This result seems to make sense since the replication
> > results
> > in twice the I/O in the cluster versus the read. However, as I scale up
> the
> > number of 5GB files from 1 to 64 files, reading ultimately takes as long
> as
> > writing. In particular, I see this result when writing and reading 64
> > such files.
> >
> > What could cause read performance to degrade faster than write
> performance
> > as the number of files increases?
> >
> > The full results (number of 5GB files, ratio of write time to read
> > time) are below:
> > 1,  2.02
> > 2,  1.87
> > 4,  1.73
> > 8,  1.54
> > 16,  1.37
> > 32,  1.29
> > 64,  1.01
> >
> > Thank you,
> >
> > Eitan
> >
>

Re: Why do reads take as long as replicated writes?

Posted by Andrew Wang <an...@cloudera.com>.

I would advise against using TestDFSIO, instead trying TeraGen and
TeraValidate. IIRC TestDFSIO doesn't actually schedule for task locality,
so it's not very good if you have a cluster bigger than your replication
factor. You might be network bound as you try to read more files.

Best,
Andrew

On Tue, Nov 4, 2014 at 6:19 AM, Eitan Rosenfeld <ei...@gmail.com> wrote:

> I am benchmarking my cluster of 16 nodes (all in one rack) with TestDFSIO
> on
> Hadoop 1.0.4.  For simplicity, I turned off speculative task execution and
> set
> the max map and reduce tasks to 1.
>
> With a replication factor of 2, writing 1 file of 5GB takes twice as long
> as
> reading 1 file. This result seems to make sense since the replication
> results
> in twice the I/O in the cluster versus the read. However, as I scale up the
> number of 5GB files from 1 to 64 files, reading ultimately takes as long as
> writing. In particular, I see this result when writing and reading 64
> such files.
>
> What could cause read performance to degrade faster than write performance
> as the number of files increases?
>
> The full results (number of 5GB files, ratio of write time to read
> time) are below:
> 1,  2.02
> 2,  1.87
> 4,  1.73
> 8,  1.54
> 16,  1.37
> 32,  1.29
> 64,  1.01
>
> Thank you,
>
> Eitan
>