You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Pankaj Gupta <pa...@brightroll.com> on 2012/11/16 19:55:25 UTC

HDFS block size

Hi,

I apologize for asking a question that has probably been discussed many times before, but I just want to be sure I understand it correctly. My question is regarding the advantages of large block size in HDFS.

The Hadoop Definitive Guide provides comparison with regular file systems and indicates the advantage being lower number of seeks(as far as I understood it, may be I read it incorreclty, if so I apologize). But, as I understand, the data node stores data on a regular file system. If this is so then how does having a bigger HDFS block size provide better seek performance, when the data will ultimately be read from regular file system which has much smaller block size.

I see other advantages of bigger block size though:
Less entries on NameNode to keep track of
Less switching from datanode to datanode for the HDFS client when fetching the file. If block size were small, just this switching would reduce the performance a lot. Perhaps this is the seek that the definitive guide refers to.
Less overhead cost of setting up Map tasks. The way MR usually works is that one Map task is created per block. Smaller block will mean less computation per map task and thus overhead of setting up the map task would become significant.


I want to make sure I understand the advantages of having a larger block size. I specifically want to know whether there is any advantage in terms of disk seeks; that one thing has got me very confused.

Thanks in Advance,
Pankaj

Re: HDFS block size

Posted by Pankaj Gupta <pa...@brightroll.com>.

Thanks for the explanation and showing a different perspective.

On Fri, Nov 16, 2012 at 12:09 PM, Ted Dunning <td...@maprtech.com> wrote:

> Andy's points are reasonable but there are a few omissions,
>
> - modern file systems are pretty good at writing large files into
> contiguous blocks if they have a reasonable amount of space available.
>
> - the seeks in question are likely to be more to do with checking
> directories for block locations than seeking to small-ish file starts
> because modern file systems tend to group together files that are written
> at about the same time.
>
> - it is quite possible to build an HDFS-like file system that uses very
> small blocks.  There really are three considerations here that, when
> conflated, make the design more difficult than necessary.  These three
> concepts are:
>
>     the primitive unit of disk allocation
>
> This is the size of disk allocation.  For HDFS, this is variable in size
> since blocks can be smaller than the max size.  The key problem with a
> large size here is that it is relatively difficult to allow quick reading
> of the file during writing.  With a smaller block size, the block can be
> committed in a way that the reader can read it much sooner.  Extremely
> large block sizes also make R/W file systems and snapshots more difficult
> for basically the same reason.  There is no strong reason that this has to
> be conflated with the striping chunk size.
>
> Putting HDFS on top of ext3 or ext4 kind of does this, but because HDFS
> knows nothing about the blocks in the underlying system, you don't get the
> benefit.
>
>     the unit of node striping
>
> This is the size of data that is sent to each node and is intended to
> achieve read parallelism in map-reduce programs.  This should be large
> enough to cause a map task to take a reasonable time to process in order to
> make task scheduling easier.  A few hundred megabytes is commonly a good
> size, but different applications may prefer sizes as small as a MB or as
> large as a few GB.
>
>     the unit of scaling
>
> It is typical that something somewhere needs to remember what gets stuck
> where in the cluster.  Currently the name node does this with blocks.
>  Blocks are a bad choice here because they come and go quite often which
> means that the namenode has to handle lots of changes and because this
> makes caching of the name node data or persisting it to disk much harder.
>  Blocks also tend to limit scaling because you have to have so many of them
> in a large system.
>
> A counter-example to the design of HDFS is the MapR architecture.  There,
> the disk blocks are 8K, chunks are a few hundred megabytes (but flexible
> within a single cluster) and the scaling unit is 10's of gigabytes.
>  Separating these concepts allows disk contiguity, efficient node striping
> and simple HA of the file system.
>
>
> On Fri, Nov 16, 2012 at 11:53 AM, Andy Isaacson <ad...@cloudera.com> wrote:
>
>> On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pa...@brightroll.com>
>> wrote:
>> > The Hadoop Definitive Guide provides comparison with regular file
>> systems
>> > and indicates the advantage being lower number of seeks(as far as I
>> > understood it, may be I read it incorreclty, if so I apologize). But,
>> as I
>> > understand, the data node stores data on a regular file system. If this
>> is
>> > so then how does having a bigger HDFS block size provide better seek
>> > performance, when the data will ultimately be read from regular file
>> system
>> > which has much smaller block size.
>>
>> Suppose that HDFS stored data in smaller blocks (64kb for example).
>> Then ext4 would have no reason to put those small files close together
>> on disk, and reading from a HDFS file would mean reading from very
>> many ext4 files, and probably would mean many seeks.
>>
>> The large block size design of HDFS avoids that problem by giving ext4
>> the information it needs to optimize for our desired use case.
>>
>> > I see other advantages of bigger block size though:
>> >
>> > Less entries on NameNode to keep track of
>>
>> That's another benefit.
>>
>> > Less switching from datanode to datanode for the HDFS client when
>> fetching
>> > the file. If block size were small, just this switching would reduce the
>> > performance a lot. Perhaps this is the seek that the definitive guide
>> refers
>> > to.
>>
>> If one were building HDFS with a smaller block size, you'd probably
>> have to overlap block fetches from many data nodes in order to get
>> decent performance. So yes, this "switching" as you term it would be a
>> performance bottleneck.
>>
>> > Less overhead cost of setting up Map tasks. The way MR usually works is
>> that
>> > one Map task is created per block. Smaller block will mean less
>> computation
>> > per map task and thus overhead of setting up the map task would become
>> > significant.
>>
>> A MR designed for a small-block-HDFS would probably have to do
>> something different rather than one mapper per block.
>>
>> > I want to make sure I understand the advantages of having a larger block
>> > size. I specifically want to know whether there is any advantage in
>> terms of
>> > disk seeks; that one thing has got me very confused.
>>
>> Seems like you have a pretty good understanding of the issues, and I
>> hope I clarified the seek issue above.
>>
>> -andy
>>
>
>


-- 


*P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | pankaj@brightroll.com

Pankaj Gupta | Software Engineer

*BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com


United States | Canada | United Kingdom | Germany


We're hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>
!

Re: HDFS block size

Posted by Pankaj Gupta <pa...@brightroll.com>.

Thanks for the explanation and showing a different perspective.

On Fri, Nov 16, 2012 at 12:09 PM, Ted Dunning <td...@maprtech.com> wrote:

> Andy's points are reasonable but there are a few omissions,
>
> - modern file systems are pretty good at writing large files into
> contiguous blocks if they have a reasonable amount of space available.
>
> - the seeks in question are likely to be more to do with checking
> directories for block locations than seeking to small-ish file starts
> because modern file systems tend to group together files that are written
> at about the same time.
>
> - it is quite possible to build an HDFS-like file system that uses very
> small blocks.  There really are three considerations here that, when
> conflated, make the design more difficult than necessary.  These three
> concepts are:
>
>     the primitive unit of disk allocation
>
> This is the size of disk allocation.  For HDFS, this is variable in size
> since blocks can be smaller than the max size.  The key problem with a
> large size here is that it is relatively difficult to allow quick reading
> of the file during writing.  With a smaller block size, the block can be
> committed in a way that the reader can read it much sooner.  Extremely
> large block sizes also make R/W file systems and snapshots more difficult
> for basically the same reason.  There is no strong reason that this has to
> be conflated with the striping chunk size.
>
> Putting HDFS on top of ext3 or ext4 kind of does this, but because HDFS
> knows nothing about the blocks in the underlying system, you don't get the
> benefit.
>
>     the unit of node striping
>
> This is the size of data that is sent to each node and is intended to
> achieve read parallelism in map-reduce programs.  This should be large
> enough to cause a map task to take a reasonable time to process in order to
> make task scheduling easier.  A few hundred megabytes is commonly a good
> size, but different applications may prefer sizes as small as a MB or as
> large as a few GB.
>
>     the unit of scaling
>
> It is typical that something somewhere needs to remember what gets stuck
> where in the cluster.  Currently the name node does this with blocks.
>  Blocks are a bad choice here because they come and go quite often which
> means that the namenode has to handle lots of changes and because this
> makes caching of the name node data or persisting it to disk much harder.
>  Blocks also tend to limit scaling because you have to have so many of them
> in a large system.
>
> A counter-example to the design of HDFS is the MapR architecture.  There,
> the disk blocks are 8K, chunks are a few hundred megabytes (but flexible
> within a single cluster) and the scaling unit is 10's of gigabytes.
>  Separating these concepts allows disk contiguity, efficient node striping
> and simple HA of the file system.
>
>
> On Fri, Nov 16, 2012 at 11:53 AM, Andy Isaacson <ad...@cloudera.com> wrote:
>
>> On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pa...@brightroll.com>
>> wrote:
>> > The Hadoop Definitive Guide provides comparison with regular file
>> systems
>> > and indicates the advantage being lower number of seeks(as far as I
>> > understood it, may be I read it incorreclty, if so I apologize). But,
>> as I
>> > understand, the data node stores data on a regular file system. If this
>> is
>> > so then how does having a bigger HDFS block size provide better seek
>> > performance, when the data will ultimately be read from regular file
>> system
>> > which has much smaller block size.
>>
>> Suppose that HDFS stored data in smaller blocks (64kb for example).
>> Then ext4 would have no reason to put those small files close together
>> on disk, and reading from a HDFS file would mean reading from very
>> many ext4 files, and probably would mean many seeks.
>>
>> The large block size design of HDFS avoids that problem by giving ext4
>> the information it needs to optimize for our desired use case.
>>
>> > I see other advantages of bigger block size though:
>> >
>> > Less entries on NameNode to keep track of
>>
>> That's another benefit.
>>
>> > Less switching from datanode to datanode for the HDFS client when
>> fetching
>> > the file. If block size were small, just this switching would reduce the
>> > performance a lot. Perhaps this is the seek that the definitive guide
>> refers
>> > to.
>>
>> If one were building HDFS with a smaller block size, you'd probably
>> have to overlap block fetches from many data nodes in order to get
>> decent performance. So yes, this "switching" as you term it would be a
>> performance bottleneck.
>>
>> > Less overhead cost of setting up Map tasks. The way MR usually works is
>> that
>> > one Map task is created per block. Smaller block will mean less
>> computation
>> > per map task and thus overhead of setting up the map task would become
>> > significant.
>>
>> A MR designed for a small-block-HDFS would probably have to do
>> something different rather than one mapper per block.
>>
>> > I want to make sure I understand the advantages of having a larger block
>> > size. I specifically want to know whether there is any advantage in
>> terms of
>> > disk seeks; that one thing has got me very confused.
>>
>> Seems like you have a pretty good understanding of the issues, and I
>> hope I clarified the seek issue above.
>>
>> -andy
>>
>
>


-- 


*P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | pankaj@brightroll.com

Pankaj Gupta | Software Engineer

*BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com


United States | Canada | United Kingdom | Germany


We're hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>
!

Re: HDFS block size

Posted by Pankaj Gupta <pa...@brightroll.com>.

Thanks for the explanation and showing a different perspective.

On Fri, Nov 16, 2012 at 12:09 PM, Ted Dunning <td...@maprtech.com> wrote:

> Andy's points are reasonable but there are a few omissions,
>
> - modern file systems are pretty good at writing large files into
> contiguous blocks if they have a reasonable amount of space available.
>
> - the seeks in question are likely to be more to do with checking
> directories for block locations than seeking to small-ish file starts
> because modern file systems tend to group together files that are written
> at about the same time.
>
> - it is quite possible to build an HDFS-like file system that uses very
> small blocks.  There really are three considerations here that, when
> conflated, make the design more difficult than necessary.  These three
> concepts are:
>
>     the primitive unit of disk allocation
>
> This is the size of disk allocation.  For HDFS, this is variable in size
> since blocks can be smaller than the max size.  The key problem with a
> large size here is that it is relatively difficult to allow quick reading
> of the file during writing.  With a smaller block size, the block can be
> committed in a way that the reader can read it much sooner.  Extremely
> large block sizes also make R/W file systems and snapshots more difficult
> for basically the same reason.  There is no strong reason that this has to
> be conflated with the striping chunk size.
>
> Putting HDFS on top of ext3 or ext4 kind of does this, but because HDFS
> knows nothing about the blocks in the underlying system, you don't get the
> benefit.
>
>     the unit of node striping
>
> This is the size of data that is sent to each node and is intended to
> achieve read parallelism in map-reduce programs.  This should be large
> enough to cause a map task to take a reasonable time to process in order to
> make task scheduling easier.  A few hundred megabytes is commonly a good
> size, but different applications may prefer sizes as small as a MB or as
> large as a few GB.
>
>     the unit of scaling
>
> It is typical that something somewhere needs to remember what gets stuck
> where in the cluster.  Currently the name node does this with blocks.
>  Blocks are a bad choice here because they come and go quite often which
> means that the namenode has to handle lots of changes and because this
> makes caching of the name node data or persisting it to disk much harder.
>  Blocks also tend to limit scaling because you have to have so many of them
> in a large system.
>
> A counter-example to the design of HDFS is the MapR architecture.  There,
> the disk blocks are 8K, chunks are a few hundred megabytes (but flexible
> within a single cluster) and the scaling unit is 10's of gigabytes.
>  Separating these concepts allows disk contiguity, efficient node striping
> and simple HA of the file system.
>
>
> On Fri, Nov 16, 2012 at 11:53 AM, Andy Isaacson <ad...@cloudera.com> wrote:
>
>> On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pa...@brightroll.com>
>> wrote:
>> > The Hadoop Definitive Guide provides comparison with regular file
>> systems
>> > and indicates the advantage being lower number of seeks(as far as I
>> > understood it, may be I read it incorreclty, if so I apologize). But,
>> as I
>> > understand, the data node stores data on a regular file system. If this
>> is
>> > so then how does having a bigger HDFS block size provide better seek
>> > performance, when the data will ultimately be read from regular file
>> system
>> > which has much smaller block size.
>>
>> Suppose that HDFS stored data in smaller blocks (64kb for example).
>> Then ext4 would have no reason to put those small files close together
>> on disk, and reading from a HDFS file would mean reading from very
>> many ext4 files, and probably would mean many seeks.
>>
>> The large block size design of HDFS avoids that problem by giving ext4
>> the information it needs to optimize for our desired use case.
>>
>> > I see other advantages of bigger block size though:
>> >
>> > Less entries on NameNode to keep track of
>>
>> That's another benefit.
>>
>> > Less switching from datanode to datanode for the HDFS client when
>> fetching
>> > the file. If block size were small, just this switching would reduce the
>> > performance a lot. Perhaps this is the seek that the definitive guide
>> refers
>> > to.
>>
>> If one were building HDFS with a smaller block size, you'd probably
>> have to overlap block fetches from many data nodes in order to get
>> decent performance. So yes, this "switching" as you term it would be a
>> performance bottleneck.
>>
>> > Less overhead cost of setting up Map tasks. The way MR usually works is
>> that
>> > one Map task is created per block. Smaller block will mean less
>> computation
>> > per map task and thus overhead of setting up the map task would become
>> > significant.
>>
>> A MR designed for a small-block-HDFS would probably have to do
>> something different rather than one mapper per block.
>>
>> > I want to make sure I understand the advantages of having a larger block
>> > size. I specifically want to know whether there is any advantage in
>> terms of
>> > disk seeks; that one thing has got me very confused.
>>
>> Seems like you have a pretty good understanding of the issues, and I
>> hope I clarified the seek issue above.
>>
>> -andy
>>
>
>


-- 


*P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | pankaj@brightroll.com

Pankaj Gupta | Software Engineer

*BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com


United States | Canada | United Kingdom | Germany


We're hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>
!

Re: HDFS block size

Posted by Pankaj Gupta <pa...@brightroll.com>.

Thanks for the explanation and showing a different perspective.

On Fri, Nov 16, 2012 at 12:09 PM, Ted Dunning <td...@maprtech.com> wrote:

> Andy's points are reasonable but there are a few omissions,
>
> - modern file systems are pretty good at writing large files into
> contiguous blocks if they have a reasonable amount of space available.
>
> - the seeks in question are likely to be more to do with checking
> directories for block locations than seeking to small-ish file starts
> because modern file systems tend to group together files that are written
> at about the same time.
>
> - it is quite possible to build an HDFS-like file system that uses very
> small blocks.  There really are three considerations here that, when
> conflated, make the design more difficult than necessary.  These three
> concepts are:
>
>     the primitive unit of disk allocation
>
> This is the size of disk allocation.  For HDFS, this is variable in size
> since blocks can be smaller than the max size.  The key problem with a
> large size here is that it is relatively difficult to allow quick reading
> of the file during writing.  With a smaller block size, the block can be
> committed in a way that the reader can read it much sooner.  Extremely
> large block sizes also make R/W file systems and snapshots more difficult
> for basically the same reason.  There is no strong reason that this has to
> be conflated with the striping chunk size.
>
> Putting HDFS on top of ext3 or ext4 kind of does this, but because HDFS
> knows nothing about the blocks in the underlying system, you don't get the
> benefit.
>
>     the unit of node striping
>
> This is the size of data that is sent to each node and is intended to
> achieve read parallelism in map-reduce programs.  This should be large
> enough to cause a map task to take a reasonable time to process in order to
> make task scheduling easier.  A few hundred megabytes is commonly a good
> size, but different applications may prefer sizes as small as a MB or as
> large as a few GB.
>
>     the unit of scaling
>
> It is typical that something somewhere needs to remember what gets stuck
> where in the cluster.  Currently the name node does this with blocks.
>  Blocks are a bad choice here because they come and go quite often which
> means that the namenode has to handle lots of changes and because this
> makes caching of the name node data or persisting it to disk much harder.
>  Blocks also tend to limit scaling because you have to have so many of them
> in a large system.
>
> A counter-example to the design of HDFS is the MapR architecture.  There,
> the disk blocks are 8K, chunks are a few hundred megabytes (but flexible
> within a single cluster) and the scaling unit is 10's of gigabytes.
>  Separating these concepts allows disk contiguity, efficient node striping
> and simple HA of the file system.
>
>
> On Fri, Nov 16, 2012 at 11:53 AM, Andy Isaacson <ad...@cloudera.com> wrote:
>
>> On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pa...@brightroll.com>
>> wrote:
>> > The Hadoop Definitive Guide provides comparison with regular file
>> systems
>> > and indicates the advantage being lower number of seeks(as far as I
>> > understood it, may be I read it incorreclty, if so I apologize). But,
>> as I
>> > understand, the data node stores data on a regular file system. If this
>> is
>> > so then how does having a bigger HDFS block size provide better seek
>> > performance, when the data will ultimately be read from regular file
>> system
>> > which has much smaller block size.
>>
>> Suppose that HDFS stored data in smaller blocks (64kb for example).
>> Then ext4 would have no reason to put those small files close together
>> on disk, and reading from a HDFS file would mean reading from very
>> many ext4 files, and probably would mean many seeks.
>>
>> The large block size design of HDFS avoids that problem by giving ext4
>> the information it needs to optimize for our desired use case.
>>
>> > I see other advantages of bigger block size though:
>> >
>> > Less entries on NameNode to keep track of
>>
>> That's another benefit.
>>
>> > Less switching from datanode to datanode for the HDFS client when
>> fetching
>> > the file. If block size were small, just this switching would reduce the
>> > performance a lot. Perhaps this is the seek that the definitive guide
>> refers
>> > to.
>>
>> If one were building HDFS with a smaller block size, you'd probably
>> have to overlap block fetches from many data nodes in order to get
>> decent performance. So yes, this "switching" as you term it would be a
>> performance bottleneck.
>>
>> > Less overhead cost of setting up Map tasks. The way MR usually works is
>> that
>> > one Map task is created per block. Smaller block will mean less
>> computation
>> > per map task and thus overhead of setting up the map task would become
>> > significant.
>>
>> A MR designed for a small-block-HDFS would probably have to do
>> something different rather than one mapper per block.
>>
>> > I want to make sure I understand the advantages of having a larger block
>> > size. I specifically want to know whether there is any advantage in
>> terms of
>> > disk seeks; that one thing has got me very confused.
>>
>> Seems like you have a pretty good understanding of the issues, and I
>> hope I clarified the seek issue above.
>>
>> -andy
>>
>
>


-- 


*P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | pankaj@brightroll.com

Pankaj Gupta | Software Engineer

*BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com


United States | Canada | United Kingdom | Germany


We're hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>
!

Re: HDFS block size

Posted by Ted Dunning <td...@maprtech.com>.

Andy's points are reasonable but there are a few omissions,

- modern file systems are pretty good at writing large files into
contiguous blocks if they have a reasonable amount of space available.

- the seeks in question are likely to be more to do with checking
directories for block locations than seeking to small-ish file starts
because modern file systems tend to group together files that are written
at about the same time.

- it is quite possible to build an HDFS-like file system that uses very
small blocks.  There really are three considerations here that, when
conflated, make the design more difficult than necessary.  These three
concepts are:

    the primitive unit of disk allocation

This is the size of disk allocation.  For HDFS, this is variable in size
since blocks can be smaller than the max size.  The key problem with a
large size here is that it is relatively difficult to allow quick reading
of the file during writing.  With a smaller block size, the block can be
committed in a way that the reader can read it much sooner.  Extremely
large block sizes also make R/W file systems and snapshots more difficult
for basically the same reason.  There is no strong reason that this has to
be conflated with the striping chunk size.

Putting HDFS on top of ext3 or ext4 kind of does this, but because HDFS
knows nothing about the blocks in the underlying system, you don't get the
benefit.

    the unit of node striping

This is the size of data that is sent to each node and is intended to
achieve read parallelism in map-reduce programs.  This should be large
enough to cause a map task to take a reasonable time to process in order to
make task scheduling easier.  A few hundred megabytes is commonly a good
size, but different applications may prefer sizes as small as a MB or as
large as a few GB.

    the unit of scaling

It is typical that something somewhere needs to remember what gets stuck
where in the cluster.  Currently the name node does this with blocks.
 Blocks are a bad choice here because they come and go quite often which
means that the namenode has to handle lots of changes and because this
makes caching of the name node data or persisting it to disk much harder.
 Blocks also tend to limit scaling because you have to have so many of them
in a large system.

A counter-example to the design of HDFS is the MapR architecture.  There,
the disk blocks are 8K, chunks are a few hundred megabytes (but flexible
within a single cluster) and the scaling unit is 10's of gigabytes.
 Separating these concepts allows disk contiguity, efficient node striping
and simple HA of the file system.

On Fri, Nov 16, 2012 at 11:53 AM, Andy Isaacson <ad...@cloudera.com> wrote:

> On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pa...@brightroll.com>
> wrote:
> > The Hadoop Definitive Guide provides comparison with regular file systems
> > and indicates the advantage being lower number of seeks(as far as I
> > understood it, may be I read it incorreclty, if so I apologize). But, as
> I
> > understand, the data node stores data on a regular file system. If this
> is
> > so then how does having a bigger HDFS block size provide better seek
> > performance, when the data will ultimately be read from regular file
> system
> > which has much smaller block size.
>
> Suppose that HDFS stored data in smaller blocks (64kb for example).
> Then ext4 would have no reason to put those small files close together
> on disk, and reading from a HDFS file would mean reading from very
> many ext4 files, and probably would mean many seeks.
>
> The large block size design of HDFS avoids that problem by giving ext4
> the information it needs to optimize for our desired use case.
>
> > I see other advantages of bigger block size though:
> >
> > Less entries on NameNode to keep track of
>
> That's another benefit.
>
> > Less switching from datanode to datanode for the HDFS client when
> fetching
> > the file. If block size were small, just this switching would reduce the
> > performance a lot. Perhaps this is the seek that the definitive guide
> refers
> > to.
>
> If one were building HDFS with a smaller block size, you'd probably
> have to overlap block fetches from many data nodes in order to get
> decent performance. So yes, this "switching" as you term it would be a
> performance bottleneck.
>
> > Less overhead cost of setting up Map tasks. The way MR usually works is
> that
> > one Map task is created per block. Smaller block will mean less
> computation
> > per map task and thus overhead of setting up the map task would become
> > significant.
>
> A MR designed for a small-block-HDFS would probably have to do
> something different rather than one mapper per block.
>
> > I want to make sure I understand the advantages of having a larger block
> > size. I specifically want to know whether there is any advantage in
> terms of
> > disk seeks; that one thing has got me very confused.
>
> Seems like you have a pretty good understanding of the issues, and I
> hope I clarified the seek issue above.
>
> -andy
>

Re: HDFS block size

Posted by Ted Dunning <td...@maprtech.com>.

Andy's points are reasonable but there are a few omissions,

- modern file systems are pretty good at writing large files into
contiguous blocks if they have a reasonable amount of space available.

- the seeks in question are likely to be more to do with checking
directories for block locations than seeking to small-ish file starts
because modern file systems tend to group together files that are written
at about the same time.

- it is quite possible to build an HDFS-like file system that uses very
small blocks.  There really are three considerations here that, when
conflated, make the design more difficult than necessary.  These three
concepts are:

    the primitive unit of disk allocation

This is the size of disk allocation.  For HDFS, this is variable in size
since blocks can be smaller than the max size.  The key problem with a
large size here is that it is relatively difficult to allow quick reading
of the file during writing.  With a smaller block size, the block can be
committed in a way that the reader can read it much sooner.  Extremely
large block sizes also make R/W file systems and snapshots more difficult
for basically the same reason.  There is no strong reason that this has to
be conflated with the striping chunk size.

Putting HDFS on top of ext3 or ext4 kind of does this, but because HDFS
knows nothing about the blocks in the underlying system, you don't get the
benefit.

    the unit of node striping

This is the size of data that is sent to each node and is intended to
achieve read parallelism in map-reduce programs.  This should be large
enough to cause a map task to take a reasonable time to process in order to
make task scheduling easier.  A few hundred megabytes is commonly a good
size, but different applications may prefer sizes as small as a MB or as
large as a few GB.

    the unit of scaling

It is typical that something somewhere needs to remember what gets stuck
where in the cluster.  Currently the name node does this with blocks.
 Blocks are a bad choice here because they come and go quite often which
means that the namenode has to handle lots of changes and because this
makes caching of the name node data or persisting it to disk much harder.
 Blocks also tend to limit scaling because you have to have so many of them
in a large system.

A counter-example to the design of HDFS is the MapR architecture.  There,
the disk blocks are 8K, chunks are a few hundred megabytes (but flexible
within a single cluster) and the scaling unit is 10's of gigabytes.
 Separating these concepts allows disk contiguity, efficient node striping
and simple HA of the file system.

On Fri, Nov 16, 2012 at 11:53 AM, Andy Isaacson <ad...@cloudera.com> wrote:

> On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pa...@brightroll.com>
> wrote:
> > The Hadoop Definitive Guide provides comparison with regular file systems
> > and indicates the advantage being lower number of seeks(as far as I
> > understood it, may be I read it incorreclty, if so I apologize). But, as
> I
> > understand, the data node stores data on a regular file system. If this
> is
> > so then how does having a bigger HDFS block size provide better seek
> > performance, when the data will ultimately be read from regular file
> system
> > which has much smaller block size.
>
> Suppose that HDFS stored data in smaller blocks (64kb for example).
> Then ext4 would have no reason to put those small files close together
> on disk, and reading from a HDFS file would mean reading from very
> many ext4 files, and probably would mean many seeks.
>
> The large block size design of HDFS avoids that problem by giving ext4
> the information it needs to optimize for our desired use case.
>
> > I see other advantages of bigger block size though:
> >
> > Less entries on NameNode to keep track of
>
> That's another benefit.
>
> > Less switching from datanode to datanode for the HDFS client when
> fetching
> > the file. If block size were small, just this switching would reduce the
> > performance a lot. Perhaps this is the seek that the definitive guide
> refers
> > to.
>
> If one were building HDFS with a smaller block size, you'd probably
> have to overlap block fetches from many data nodes in order to get
> decent performance. So yes, this "switching" as you term it would be a
> performance bottleneck.
>
> > Less overhead cost of setting up Map tasks. The way MR usually works is
> that
> > one Map task is created per block. Smaller block will mean less
> computation
> > per map task and thus overhead of setting up the map task would become
> > significant.
>
> A MR designed for a small-block-HDFS would probably have to do
> something different rather than one mapper per block.
>
> > I want to make sure I understand the advantages of having a larger block
> > size. I specifically want to know whether there is any advantage in
> terms of
> > disk seeks; that one thing has got me very confused.
>
> Seems like you have a pretty good understanding of the issues, and I
> hope I clarified the seek issue above.
>
> -andy
>

Re: HDFS block size

Posted by Ted Dunning <td...@maprtech.com>.

Andy's points are reasonable but there are a few omissions,

- modern file systems are pretty good at writing large files into
contiguous blocks if they have a reasonable amount of space available.

- the seeks in question are likely to be more to do with checking
directories for block locations than seeking to small-ish file starts
because modern file systems tend to group together files that are written
at about the same time.

- it is quite possible to build an HDFS-like file system that uses very
small blocks.  There really are three considerations here that, when
conflated, make the design more difficult than necessary.  These three
concepts are:

    the primitive unit of disk allocation

This is the size of disk allocation.  For HDFS, this is variable in size
since blocks can be smaller than the max size.  The key problem with a
large size here is that it is relatively difficult to allow quick reading
of the file during writing.  With a smaller block size, the block can be
committed in a way that the reader can read it much sooner.  Extremely
large block sizes also make R/W file systems and snapshots more difficult
for basically the same reason.  There is no strong reason that this has to
be conflated with the striping chunk size.

Putting HDFS on top of ext3 or ext4 kind of does this, but because HDFS
knows nothing about the blocks in the underlying system, you don't get the
benefit.

    the unit of node striping

This is the size of data that is sent to each node and is intended to
achieve read parallelism in map-reduce programs.  This should be large
enough to cause a map task to take a reasonable time to process in order to
make task scheduling easier.  A few hundred megabytes is commonly a good
size, but different applications may prefer sizes as small as a MB or as
large as a few GB.

    the unit of scaling

It is typical that something somewhere needs to remember what gets stuck
where in the cluster.  Currently the name node does this with blocks.
 Blocks are a bad choice here because they come and go quite often which
means that the namenode has to handle lots of changes and because this
makes caching of the name node data or persisting it to disk much harder.
 Blocks also tend to limit scaling because you have to have so many of them
in a large system.

A counter-example to the design of HDFS is the MapR architecture.  There,
the disk blocks are 8K, chunks are a few hundred megabytes (but flexible
within a single cluster) and the scaling unit is 10's of gigabytes.
 Separating these concepts allows disk contiguity, efficient node striping
and simple HA of the file system.

On Fri, Nov 16, 2012 at 11:53 AM, Andy Isaacson <ad...@cloudera.com> wrote:

> On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pa...@brightroll.com>
> wrote:
> > The Hadoop Definitive Guide provides comparison with regular file systems
> > and indicates the advantage being lower number of seeks(as far as I
> > understood it, may be I read it incorreclty, if so I apologize). But, as
> I
> > understand, the data node stores data on a regular file system. If this
> is
> > so then how does having a bigger HDFS block size provide better seek
> > performance, when the data will ultimately be read from regular file
> system
> > which has much smaller block size.
>
> Suppose that HDFS stored data in smaller blocks (64kb for example).
> Then ext4 would have no reason to put those small files close together
> on disk, and reading from a HDFS file would mean reading from very
> many ext4 files, and probably would mean many seeks.
>
> The large block size design of HDFS avoids that problem by giving ext4
> the information it needs to optimize for our desired use case.
>
> > I see other advantages of bigger block size though:
> >
> > Less entries on NameNode to keep track of
>
> That's another benefit.
>
> > Less switching from datanode to datanode for the HDFS client when
> fetching
> > the file. If block size were small, just this switching would reduce the
> > performance a lot. Perhaps this is the seek that the definitive guide
> refers
> > to.
>
> If one were building HDFS with a smaller block size, you'd probably
> have to overlap block fetches from many data nodes in order to get
> decent performance. So yes, this "switching" as you term it would be a
> performance bottleneck.
>
> > Less overhead cost of setting up Map tasks. The way MR usually works is
> that
> > one Map task is created per block. Smaller block will mean less
> computation
> > per map task and thus overhead of setting up the map task would become
> > significant.
>
> A MR designed for a small-block-HDFS would probably have to do
> something different rather than one mapper per block.
>
> > I want to make sure I understand the advantages of having a larger block
> > size. I specifically want to know whether there is any advantage in
> terms of
> > disk seeks; that one thing has got me very confused.
>
> Seems like you have a pretty good understanding of the issues, and I
> hope I clarified the seek issue above.
>
> -andy
>

Re: HDFS block size

Posted by Ted Dunning <td...@maprtech.com>.

Andy's points are reasonable but there are a few omissions,

- modern file systems are pretty good at writing large files into
contiguous blocks if they have a reasonable amount of space available.

- the seeks in question are likely to be more to do with checking
directories for block locations than seeking to small-ish file starts
because modern file systems tend to group together files that are written
at about the same time.

- it is quite possible to build an HDFS-like file system that uses very
small blocks.  There really are three considerations here that, when
conflated, make the design more difficult than necessary.  These three
concepts are:

    the primitive unit of disk allocation

This is the size of disk allocation.  For HDFS, this is variable in size
since blocks can be smaller than the max size.  The key problem with a
large size here is that it is relatively difficult to allow quick reading
of the file during writing.  With a smaller block size, the block can be
committed in a way that the reader can read it much sooner.  Extremely
large block sizes also make R/W file systems and snapshots more difficult
for basically the same reason.  There is no strong reason that this has to
be conflated with the striping chunk size.

Putting HDFS on top of ext3 or ext4 kind of does this, but because HDFS
knows nothing about the blocks in the underlying system, you don't get the
benefit.

    the unit of node striping

This is the size of data that is sent to each node and is intended to
achieve read parallelism in map-reduce programs.  This should be large
enough to cause a map task to take a reasonable time to process in order to
make task scheduling easier.  A few hundred megabytes is commonly a good
size, but different applications may prefer sizes as small as a MB or as
large as a few GB.

    the unit of scaling

It is typical that something somewhere needs to remember what gets stuck
where in the cluster.  Currently the name node does this with blocks.
 Blocks are a bad choice here because they come and go quite often which
means that the namenode has to handle lots of changes and because this
makes caching of the name node data or persisting it to disk much harder.
 Blocks also tend to limit scaling because you have to have so many of them
in a large system.

A counter-example to the design of HDFS is the MapR architecture.  There,
the disk blocks are 8K, chunks are a few hundred megabytes (but flexible
within a single cluster) and the scaling unit is 10's of gigabytes.
 Separating these concepts allows disk contiguity, efficient node striping
and simple HA of the file system.

On Fri, Nov 16, 2012 at 11:53 AM, Andy Isaacson <ad...@cloudera.com> wrote:

> On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pa...@brightroll.com>
> wrote:
> > The Hadoop Definitive Guide provides comparison with regular file systems
> > and indicates the advantage being lower number of seeks(as far as I
> > understood it, may be I read it incorreclty, if so I apologize). But, as
> I
> > understand, the data node stores data on a regular file system. If this
> is
> > so then how does having a bigger HDFS block size provide better seek
> > performance, when the data will ultimately be read from regular file
> system
> > which has much smaller block size.
>
> Suppose that HDFS stored data in smaller blocks (64kb for example).
> Then ext4 would have no reason to put those small files close together
> on disk, and reading from a HDFS file would mean reading from very
> many ext4 files, and probably would mean many seeks.
>
> The large block size design of HDFS avoids that problem by giving ext4
> the information it needs to optimize for our desired use case.
>
> > I see other advantages of bigger block size though:
> >
> > Less entries on NameNode to keep track of
>
> That's another benefit.
>
> > Less switching from datanode to datanode for the HDFS client when
> fetching
> > the file. If block size were small, just this switching would reduce the
> > performance a lot. Perhaps this is the seek that the definitive guide
> refers
> > to.
>
> If one were building HDFS with a smaller block size, you'd probably
> have to overlap block fetches from many data nodes in order to get
> decent performance. So yes, this "switching" as you term it would be a
> performance bottleneck.
>
> > Less overhead cost of setting up Map tasks. The way MR usually works is
> that
> > one Map task is created per block. Smaller block will mean less
> computation
> > per map task and thus overhead of setting up the map task would become
> > significant.
>
> A MR designed for a small-block-HDFS would probably have to do
> something different rather than one mapper per block.
>
> > I want to make sure I understand the advantages of having a larger block
> > size. I specifically want to know whether there is any advantage in
> terms of
> > disk seeks; that one thing has got me very confused.
>
> Seems like you have a pretty good understanding of the issues, and I
> hope I clarified the seek issue above.
>
> -andy
>

Re: HDFS block size

Posted by Pankaj Gupta <pa...@brightroll.com>.

Thanks for the explanation. Sounds like the seek performance is faster
because reading one large file on the filesystem is faster than reading
many small files; that makes sense.

On Fri, Nov 16, 2012 at 11:53 AM, Andy Isaacson <ad...@cloudera.com> wrote:

> On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pa...@brightroll.com>
> wrote:
> > The Hadoop Definitive Guide provides comparison with regular file systems
> > and indicates the advantage being lower number of seeks(as far as I
> > understood it, may be I read it incorreclty, if so I apologize). But, as
> I
> > understand, the data node stores data on a regular file system. If this
> is
> > so then how does having a bigger HDFS block size provide better seek
> > performance, when the data will ultimately be read from regular file
> system
> > which has much smaller block size.
>
> Suppose that HDFS stored data in smaller blocks (64kb for example).
> Then ext4 would have no reason to put those small files close together
> on disk, and reading from a HDFS file would mean reading from very
> many ext4 files, and probably would mean many seeks.
>
> The large block size design of HDFS avoids that problem by giving ext4
> the information it needs to optimize for our desired use case.
>
> > I see other advantages of bigger block size though:
> >
> > Less entries on NameNode to keep track of
>
> That's another benefit.
>
> > Less switching from datanode to datanode for the HDFS client when
> fetching
> > the file. If block size were small, just this switching would reduce the
> > performance a lot. Perhaps this is the seek that the definitive guide
> refers
> > to.
>
> If one were building HDFS with a smaller block size, you'd probably
> have to overlap block fetches from many data nodes in order to get
> decent performance. So yes, this "switching" as you term it would be a
> performance bottleneck.
>
> > Less overhead cost of setting up Map tasks. The way MR usually works is
> that
> > one Map task is created per block. Smaller block will mean less
> computation
> > per map task and thus overhead of setting up the map task would become
> > significant.
>
> A MR designed for a small-block-HDFS would probably have to do
> something different rather than one mapper per block.
>
> > I want to make sure I understand the advantages of having a larger block
> > size. I specifically want to know whether there is any advantage in
> terms of
> > disk seeks; that one thing has got me very confused.
>
> Seems like you have a pretty good understanding of the issues, and I
> hope I clarified the seek issue above.
>
> -andy
>



-- 


*P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | pankaj@brightroll.com

Pankaj Gupta | Software Engineer

*BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com


United States | Canada | United Kingdom | Germany


We're hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>
!

Re: HDFS block size

Posted by Pankaj Gupta <pa...@brightroll.com>.

Thanks for the explanation. Sounds like the seek performance is faster
because reading one large file on the filesystem is faster than reading
many small files; that makes sense.

On Fri, Nov 16, 2012 at 11:53 AM, Andy Isaacson <ad...@cloudera.com> wrote:

> On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pa...@brightroll.com>
> wrote:
> > The Hadoop Definitive Guide provides comparison with regular file systems
> > and indicates the advantage being lower number of seeks(as far as I
> > understood it, may be I read it incorreclty, if so I apologize). But, as
> I
> > understand, the data node stores data on a regular file system. If this
> is
> > so then how does having a bigger HDFS block size provide better seek
> > performance, when the data will ultimately be read from regular file
> system
> > which has much smaller block size.
>
> Suppose that HDFS stored data in smaller blocks (64kb for example).
> Then ext4 would have no reason to put those small files close together
> on disk, and reading from a HDFS file would mean reading from very
> many ext4 files, and probably would mean many seeks.
>
> The large block size design of HDFS avoids that problem by giving ext4
> the information it needs to optimize for our desired use case.
>
> > I see other advantages of bigger block size though:
> >
> > Less entries on NameNode to keep track of
>
> That's another benefit.
>
> > Less switching from datanode to datanode for the HDFS client when
> fetching
> > the file. If block size were small, just this switching would reduce the
> > performance a lot. Perhaps this is the seek that the definitive guide
> refers
> > to.
>
> If one were building HDFS with a smaller block size, you'd probably
> have to overlap block fetches from many data nodes in order to get
> decent performance. So yes, this "switching" as you term it would be a
> performance bottleneck.
>
> > Less overhead cost of setting up Map tasks. The way MR usually works is
> that
> > one Map task is created per block. Smaller block will mean less
> computation
> > per map task and thus overhead of setting up the map task would become
> > significant.
>
> A MR designed for a small-block-HDFS would probably have to do
> something different rather than one mapper per block.
>
> > I want to make sure I understand the advantages of having a larger block
> > size. I specifically want to know whether there is any advantage in
> terms of
> > disk seeks; that one thing has got me very confused.
>
> Seems like you have a pretty good understanding of the issues, and I
> hope I clarified the seek issue above.
>
> -andy
>



-- 


*P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | pankaj@brightroll.com

Pankaj Gupta | Software Engineer

*BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com


United States | Canada | United Kingdom | Germany


We're hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>
!

Re: HDFS block size

Posted by Pankaj Gupta <pa...@brightroll.com>.

Thanks for the explanation. Sounds like the seek performance is faster
because reading one large file on the filesystem is faster than reading
many small files; that makes sense.

On Fri, Nov 16, 2012 at 11:53 AM, Andy Isaacson <ad...@cloudera.com> wrote:

> On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pa...@brightroll.com>
> wrote:
> > The Hadoop Definitive Guide provides comparison with regular file systems
> > and indicates the advantage being lower number of seeks(as far as I
> > understood it, may be I read it incorreclty, if so I apologize). But, as
> I
> > understand, the data node stores data on a regular file system. If this
> is
> > so then how does having a bigger HDFS block size provide better seek
> > performance, when the data will ultimately be read from regular file
> system
> > which has much smaller block size.
>
> Suppose that HDFS stored data in smaller blocks (64kb for example).
> Then ext4 would have no reason to put those small files close together
> on disk, and reading from a HDFS file would mean reading from very
> many ext4 files, and probably would mean many seeks.
>
> The large block size design of HDFS avoids that problem by giving ext4
> the information it needs to optimize for our desired use case.
>
> > I see other advantages of bigger block size though:
> >
> > Less entries on NameNode to keep track of
>
> That's another benefit.
>
> > Less switching from datanode to datanode for the HDFS client when
> fetching
> > the file. If block size were small, just this switching would reduce the
> > performance a lot. Perhaps this is the seek that the definitive guide
> refers
> > to.
>
> If one were building HDFS with a smaller block size, you'd probably
> have to overlap block fetches from many data nodes in order to get
> decent performance. So yes, this "switching" as you term it would be a
> performance bottleneck.
>
> > Less overhead cost of setting up Map tasks. The way MR usually works is
> that
> > one Map task is created per block. Smaller block will mean less
> computation
> > per map task and thus overhead of setting up the map task would become
> > significant.
>
> A MR designed for a small-block-HDFS would probably have to do
> something different rather than one mapper per block.
>
> > I want to make sure I understand the advantages of having a larger block
> > size. I specifically want to know whether there is any advantage in
> terms of
> > disk seeks; that one thing has got me very confused.
>
> Seems like you have a pretty good understanding of the issues, and I
> hope I clarified the seek issue above.
>
> -andy
>



-- 


*P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | pankaj@brightroll.com

Pankaj Gupta | Software Engineer

*BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com


United States | Canada | United Kingdom | Germany


We're hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>
!

Re: HDFS block size

Posted by Pankaj Gupta <pa...@brightroll.com>.

Thanks for the explanation. Sounds like the seek performance is faster
because reading one large file on the filesystem is faster than reading
many small files; that makes sense.

On Fri, Nov 16, 2012 at 11:53 AM, Andy Isaacson <ad...@cloudera.com> wrote:

> On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pa...@brightroll.com>
> wrote:
> > The Hadoop Definitive Guide provides comparison with regular file systems
> > and indicates the advantage being lower number of seeks(as far as I
> > understood it, may be I read it incorreclty, if so I apologize). But, as
> I
> > understand, the data node stores data on a regular file system. If this
> is
> > so then how does having a bigger HDFS block size provide better seek
> > performance, when the data will ultimately be read from regular file
> system
> > which has much smaller block size.
>
> Suppose that HDFS stored data in smaller blocks (64kb for example).
> Then ext4 would have no reason to put those small files close together
> on disk, and reading from a HDFS file would mean reading from very
> many ext4 files, and probably would mean many seeks.
>
> The large block size design of HDFS avoids that problem by giving ext4
> the information it needs to optimize for our desired use case.
>
> > I see other advantages of bigger block size though:
> >
> > Less entries on NameNode to keep track of
>
> That's another benefit.
>
> > Less switching from datanode to datanode for the HDFS client when
> fetching
> > the file. If block size were small, just this switching would reduce the
> > performance a lot. Perhaps this is the seek that the definitive guide
> refers
> > to.
>
> If one were building HDFS with a smaller block size, you'd probably
> have to overlap block fetches from many data nodes in order to get
> decent performance. So yes, this "switching" as you term it would be a
> performance bottleneck.
>
> > Less overhead cost of setting up Map tasks. The way MR usually works is
> that
> > one Map task is created per block. Smaller block will mean less
> computation
> > per map task and thus overhead of setting up the map task would become
> > significant.
>
> A MR designed for a small-block-HDFS would probably have to do
> something different rather than one mapper per block.
>
> > I want to make sure I understand the advantages of having a larger block
> > size. I specifically want to know whether there is any advantage in
> terms of
> > disk seeks; that one thing has got me very confused.
>
> Seems like you have a pretty good understanding of the issues, and I
> hope I clarified the seek issue above.
>
> -andy
>



-- 


*P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | pankaj@brightroll.com

Pankaj Gupta | Software Engineer

*BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com


United States | Canada | United Kingdom | Germany


We're hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>
!

Re: HDFS block size

Posted by Andy Isaacson <ad...@cloudera.com>.

On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pa...@brightroll.com> wrote:
> The Hadoop Definitive Guide provides comparison with regular file systems
> and indicates the advantage being lower number of seeks(as far as I
> understood it, may be I read it incorreclty, if so I apologize). But, as I
> understand, the data node stores data on a regular file system. If this is
> so then how does having a bigger HDFS block size provide better seek
> performance, when the data will ultimately be read from regular file system
> which has much smaller block size.

Suppose that HDFS stored data in smaller blocks (64kb for example).
Then ext4 would have no reason to put those small files close together
on disk, and reading from a HDFS file would mean reading from very
many ext4 files, and probably would mean many seeks.

The large block size design of HDFS avoids that problem by giving ext4
the information it needs to optimize for our desired use case.

> I see other advantages of bigger block size though:
>
> Less entries on NameNode to keep track of

That's another benefit.

> Less switching from datanode to datanode for the HDFS client when fetching
> the file. If block size were small, just this switching would reduce the
> performance a lot. Perhaps this is the seek that the definitive guide refers
> to.

If one were building HDFS with a smaller block size, you'd probably
have to overlap block fetches from many data nodes in order to get
decent performance. So yes, this "switching" as you term it would be a
performance bottleneck.

> Less overhead cost of setting up Map tasks. The way MR usually works is that
> one Map task is created per block. Smaller block will mean less computation
> per map task and thus overhead of setting up the map task would become
> significant.

A MR designed for a small-block-HDFS would probably have to do
something different rather than one mapper per block.

> I want to make sure I understand the advantages of having a larger block
> size. I specifically want to know whether there is any advantage in terms of
> disk seeks; that one thing has got me very confused.

Seems like you have a pretty good understanding of the issues, and I
hope I clarified the seek issue above.

-andy

Re: HDFS block size

Posted by Andy Isaacson <ad...@cloudera.com>.

On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pa...@brightroll.com> wrote:
> The Hadoop Definitive Guide provides comparison with regular file systems
> and indicates the advantage being lower number of seeks(as far as I
> understood it, may be I read it incorreclty, if so I apologize). But, as I
> understand, the data node stores data on a regular file system. If this is
> so then how does having a bigger HDFS block size provide better seek
> performance, when the data will ultimately be read from regular file system
> which has much smaller block size.

Suppose that HDFS stored data in smaller blocks (64kb for example).
Then ext4 would have no reason to put those small files close together
on disk, and reading from a HDFS file would mean reading from very
many ext4 files, and probably would mean many seeks.

The large block size design of HDFS avoids that problem by giving ext4
the information it needs to optimize for our desired use case.

> I see other advantages of bigger block size though:
>
> Less entries on NameNode to keep track of

That's another benefit.

> Less switching from datanode to datanode for the HDFS client when fetching
> the file. If block size were small, just this switching would reduce the
> performance a lot. Perhaps this is the seek that the definitive guide refers
> to.

If one were building HDFS with a smaller block size, you'd probably
have to overlap block fetches from many data nodes in order to get
decent performance. So yes, this "switching" as you term it would be a
performance bottleneck.

> Less overhead cost of setting up Map tasks. The way MR usually works is that
> one Map task is created per block. Smaller block will mean less computation
> per map task and thus overhead of setting up the map task would become
> significant.

A MR designed for a small-block-HDFS would probably have to do
something different rather than one mapper per block.

> I want to make sure I understand the advantages of having a larger block
> size. I specifically want to know whether there is any advantage in terms of
> disk seeks; that one thing has got me very confused.

Seems like you have a pretty good understanding of the issues, and I
hope I clarified the seek issue above.

-andy

Re: HDFS block size

Posted by Andy Isaacson <ad...@cloudera.com>.

On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pa...@brightroll.com> wrote:
> The Hadoop Definitive Guide provides comparison with regular file systems
> and indicates the advantage being lower number of seeks(as far as I
> understood it, may be I read it incorreclty, if so I apologize). But, as I
> understand, the data node stores data on a regular file system. If this is
> so then how does having a bigger HDFS block size provide better seek
> performance, when the data will ultimately be read from regular file system
> which has much smaller block size.

Suppose that HDFS stored data in smaller blocks (64kb for example).
Then ext4 would have no reason to put those small files close together
on disk, and reading from a HDFS file would mean reading from very
many ext4 files, and probably would mean many seeks.

The large block size design of HDFS avoids that problem by giving ext4
the information it needs to optimize for our desired use case.

> I see other advantages of bigger block size though:
>
> Less entries on NameNode to keep track of

That's another benefit.

> Less switching from datanode to datanode for the HDFS client when fetching
> the file. If block size were small, just this switching would reduce the
> performance a lot. Perhaps this is the seek that the definitive guide refers
> to.

If one were building HDFS with a smaller block size, you'd probably
have to overlap block fetches from many data nodes in order to get
decent performance. So yes, this "switching" as you term it would be a
performance bottleneck.

> Less overhead cost of setting up Map tasks. The way MR usually works is that
> one Map task is created per block. Smaller block will mean less computation
> per map task and thus overhead of setting up the map task would become
> significant.

A MR designed for a small-block-HDFS would probably have to do
something different rather than one mapper per block.

> I want to make sure I understand the advantages of having a larger block
> size. I specifically want to know whether there is any advantage in terms of
> disk seeks; that one thing has got me very confused.

Seems like you have a pretty good understanding of the issues, and I
hope I clarified the seek issue above.

-andy

Re: HDFS block size

Posted by Andy Isaacson <ad...@cloudera.com>.

On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pa...@brightroll.com> wrote:
> The Hadoop Definitive Guide provides comparison with regular file systems
> and indicates the advantage being lower number of seeks(as far as I
> understood it, may be I read it incorreclty, if so I apologize). But, as I
> understand, the data node stores data on a regular file system. If this is
> so then how does having a bigger HDFS block size provide better seek
> performance, when the data will ultimately be read from regular file system
> which has much smaller block size.

Suppose that HDFS stored data in smaller blocks (64kb for example).
Then ext4 would have no reason to put those small files close together
on disk, and reading from a HDFS file would mean reading from very
many ext4 files, and probably would mean many seeks.

The large block size design of HDFS avoids that problem by giving ext4
the information it needs to optimize for our desired use case.

> I see other advantages of bigger block size though:
>
> Less entries on NameNode to keep track of

That's another benefit.

> Less switching from datanode to datanode for the HDFS client when fetching
> the file. If block size were small, just this switching would reduce the
> performance a lot. Perhaps this is the seek that the definitive guide refers
> to.

If one were building HDFS with a smaller block size, you'd probably
have to overlap block fetches from many data nodes in order to get
decent performance. So yes, this "switching" as you term it would be a
performance bottleneck.

> Less overhead cost of setting up Map tasks. The way MR usually works is that
> one Map task is created per block. Smaller block will mean less computation
> per map task and thus overhead of setting up the map task would become
> significant.

A MR designed for a small-block-HDFS would probably have to do
something different rather than one mapper per block.

> I want to make sure I understand the advantages of having a larger block
> size. I specifically want to know whether there is any advantage in terms of
> disk seeks; that one thing has got me very confused.

Seems like you have a pretty good understanding of the issues, and I
hope I clarified the seek issue above.

-andy