You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Rahul Bhattacharjee <ra...@gmail.com> on 2013/03/31 18:55:36 UTC

Why big block size for HDFS.

Hi,

Many places it has been written that to avoid huge no of disk seeks , we
store big blocks in HDFS , so that once we seek to the location , then
there is only data transfer rate which would be predominant , no more
seeks. I am not sure if I have understood this correctly.

My question is , no matter what the block size we decide , finally its
getting written to the computers HDD , which would be formatted and would
have a block size in KB's and also while writing to the FS (not HDFS) , its
not guaranteed that the blocks that we write are continuous , so there
would be disk seeks anyways .The assumption of HDFS would be only true if
the underlying Fs guarentees to write the data in continuous blocks.

Can someone explain a bit.
Thanks,
Rahul

Re: Why big block size for HDFS.

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks a lot John , Azurya.

I guessed about the optimization of HDD. Then it might be good to defrag
the underlying disk during general maintenance downtime.

Thanks,
Rahul




On Mon, Apr 1, 2013 at 12:28 AM, John Lilley <jo...@redpoint.net>wrote:

>  ** **
>
> *From:* Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
> *Subject:* Why big block size for HDFS.****
>
> ** **
>
> >Many places it has been written that to avoid huge no of disk seeks , we
> store big blocks in HDFS , so that once we seek to the location , then
> there is only data transfer rate which would be predominant , no more
> seeks. I am not sure if I have understood this correctly.****
>
> >My question is , no matter what the block size we decide , finally its
> getting written to the computers HDD , which would be formatted and would
> have a block size in KB's and also while writing to the FS (not HDFS) , its
> not guaranteed that the blocks that we write are continuous , so there
> would be disk seeks anyways .The assumption of HDFS would be only true if
> the underlying Fs guarentees to write the data in continuous blocks.****
>
>
> >Can someone explain a bit.****
>
> >Thanks,
> >Rahul  ****
>
> ** **
>
> While there are no guarantees that disk storage will be contiguous, the OS
> will attempt to keep large files contiguous (and may even defrag over
> time), and if all files are written using large blocks, this is more likely
> to be the case.  If storage is contiguous, you can write a complete track
> without seeking.  A complete track size varies, but a 1TB disk might have
> 500KB/track.  Stepping adjacent close tracks is also much cheaper than the
> average seek time, and as you might expect, has been optimized in hardware
> to assist sequential I/O.  However, if you switch storage units, you will
> probably encounter at least one full seek at the start of the block (since
> it was probably somewhere else at the time).  The result is that, on
> average, writing sequential files is very fast (>100MB/sec on typical
> SATA).  But I think that the blocks overhead has more to do with finding
> where to read the next block from, assuming that data has been distributed
> evenly.****
>
> ** **
>
> So consider connection overhead when the data is distributed.  I am no
> expert on the Hadoop internals, but I suspect that somewhere, a TCP
> connection is opened to transfer data.  Whether connection overhead is
> reduced by maintaining open connection pools, I don’t know.  But let’s
> assume that there is **some** overhead for switching data transfer from
> machine “A”  that owns block “1000” and machine “B” that owns block
> “1001”.  The larger the block size, the less significant will be this
> overhead relative to the sequential transfer rate.  ****
>
> ** **
>
> In addition, MapR/YARN has an easier time of scheduling if there are fewer
> blocks.****
>
> --john****
>

Re: Why big block size for HDFS.

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks a lot John , Azurya.

I guessed about the optimization of HDD. Then it might be good to defrag
the underlying disk during general maintenance downtime.

Thanks,
Rahul




On Mon, Apr 1, 2013 at 12:28 AM, John Lilley <jo...@redpoint.net>wrote:

>  ** **
>
> *From:* Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
> *Subject:* Why big block size for HDFS.****
>
> ** **
>
> >Many places it has been written that to avoid huge no of disk seeks , we
> store big blocks in HDFS , so that once we seek to the location , then
> there is only data transfer rate which would be predominant , no more
> seeks. I am not sure if I have understood this correctly.****
>
> >My question is , no matter what the block size we decide , finally its
> getting written to the computers HDD , which would be formatted and would
> have a block size in KB's and also while writing to the FS (not HDFS) , its
> not guaranteed that the blocks that we write are continuous , so there
> would be disk seeks anyways .The assumption of HDFS would be only true if
> the underlying Fs guarentees to write the data in continuous blocks.****
>
>
> >Can someone explain a bit.****
>
> >Thanks,
> >Rahul  ****
>
> ** **
>
> While there are no guarantees that disk storage will be contiguous, the OS
> will attempt to keep large files contiguous (and may even defrag over
> time), and if all files are written using large blocks, this is more likely
> to be the case.  If storage is contiguous, you can write a complete track
> without seeking.  A complete track size varies, but a 1TB disk might have
> 500KB/track.  Stepping adjacent close tracks is also much cheaper than the
> average seek time, and as you might expect, has been optimized in hardware
> to assist sequential I/O.  However, if you switch storage units, you will
> probably encounter at least one full seek at the start of the block (since
> it was probably somewhere else at the time).  The result is that, on
> average, writing sequential files is very fast (>100MB/sec on typical
> SATA).  But I think that the blocks overhead has more to do with finding
> where to read the next block from, assuming that data has been distributed
> evenly.****
>
> ** **
>
> So consider connection overhead when the data is distributed.  I am no
> expert on the Hadoop internals, but I suspect that somewhere, a TCP
> connection is opened to transfer data.  Whether connection overhead is
> reduced by maintaining open connection pools, I don’t know.  But let’s
> assume that there is **some** overhead for switching data transfer from
> machine “A”  that owns block “1000” and machine “B” that owns block
> “1001”.  The larger the block size, the less significant will be this
> overhead relative to the sequential transfer rate.  ****
>
> ** **
>
> In addition, MapR/YARN has an easier time of scheduling if there are fewer
> blocks.****
>
> --john****
>

Re: Why big block size for HDFS.

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks a lot John , Azurya.

I guessed about the optimization of HDD. Then it might be good to defrag
the underlying disk during general maintenance downtime.

Thanks,
Rahul




On Mon, Apr 1, 2013 at 12:28 AM, John Lilley <jo...@redpoint.net>wrote:

>  ** **
>
> *From:* Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
> *Subject:* Why big block size for HDFS.****
>
> ** **
>
> >Many places it has been written that to avoid huge no of disk seeks , we
> store big blocks in HDFS , so that once we seek to the location , then
> there is only data transfer rate which would be predominant , no more
> seeks. I am not sure if I have understood this correctly.****
>
> >My question is , no matter what the block size we decide , finally its
> getting written to the computers HDD , which would be formatted and would
> have a block size in KB's and also while writing to the FS (not HDFS) , its
> not guaranteed that the blocks that we write are continuous , so there
> would be disk seeks anyways .The assumption of HDFS would be only true if
> the underlying Fs guarentees to write the data in continuous blocks.****
>
>
> >Can someone explain a bit.****
>
> >Thanks,
> >Rahul  ****
>
> ** **
>
> While there are no guarantees that disk storage will be contiguous, the OS
> will attempt to keep large files contiguous (and may even defrag over
> time), and if all files are written using large blocks, this is more likely
> to be the case.  If storage is contiguous, you can write a complete track
> without seeking.  A complete track size varies, but a 1TB disk might have
> 500KB/track.  Stepping adjacent close tracks is also much cheaper than the
> average seek time, and as you might expect, has been optimized in hardware
> to assist sequential I/O.  However, if you switch storage units, you will
> probably encounter at least one full seek at the start of the block (since
> it was probably somewhere else at the time).  The result is that, on
> average, writing sequential files is very fast (>100MB/sec on typical
> SATA).  But I think that the blocks overhead has more to do with finding
> where to read the next block from, assuming that data has been distributed
> evenly.****
>
> ** **
>
> So consider connection overhead when the data is distributed.  I am no
> expert on the Hadoop internals, but I suspect that somewhere, a TCP
> connection is opened to transfer data.  Whether connection overhead is
> reduced by maintaining open connection pools, I don’t know.  But let’s
> assume that there is **some** overhead for switching data transfer from
> machine “A”  that owns block “1000” and machine “B” that owns block
> “1001”.  The larger the block size, the less significant will be this
> overhead relative to the sequential transfer rate.  ****
>
> ** **
>
> In addition, MapR/YARN has an easier time of scheduling if there are fewer
> blocks.****
>
> --john****
>

Re: Why big block size for HDFS.

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks a lot John , Azurya.

I guessed about the optimization of HDD. Then it might be good to defrag
the underlying disk during general maintenance downtime.

Thanks,
Rahul




On Mon, Apr 1, 2013 at 12:28 AM, John Lilley <jo...@redpoint.net>wrote:

>  ** **
>
> *From:* Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
> *Subject:* Why big block size for HDFS.****
>
> ** **
>
> >Many places it has been written that to avoid huge no of disk seeks , we
> store big blocks in HDFS , so that once we seek to the location , then
> there is only data transfer rate which would be predominant , no more
> seeks. I am not sure if I have understood this correctly.****
>
> >My question is , no matter what the block size we decide , finally its
> getting written to the computers HDD , which would be formatted and would
> have a block size in KB's and also while writing to the FS (not HDFS) , its
> not guaranteed that the blocks that we write are continuous , so there
> would be disk seeks anyways .The assumption of HDFS would be only true if
> the underlying Fs guarentees to write the data in continuous blocks.****
>
>
> >Can someone explain a bit.****
>
> >Thanks,
> >Rahul  ****
>
> ** **
>
> While there are no guarantees that disk storage will be contiguous, the OS
> will attempt to keep large files contiguous (and may even defrag over
> time), and if all files are written using large blocks, this is more likely
> to be the case.  If storage is contiguous, you can write a complete track
> without seeking.  A complete track size varies, but a 1TB disk might have
> 500KB/track.  Stepping adjacent close tracks is also much cheaper than the
> average seek time, and as you might expect, has been optimized in hardware
> to assist sequential I/O.  However, if you switch storage units, you will
> probably encounter at least one full seek at the start of the block (since
> it was probably somewhere else at the time).  The result is that, on
> average, writing sequential files is very fast (>100MB/sec on typical
> SATA).  But I think that the blocks overhead has more to do with finding
> where to read the next block from, assuming that data has been distributed
> evenly.****
>
> ** **
>
> So consider connection overhead when the data is distributed.  I am no
> expert on the Hadoop internals, but I suspect that somewhere, a TCP
> connection is opened to transfer data.  Whether connection overhead is
> reduced by maintaining open connection pools, I don’t know.  But let’s
> assume that there is **some** overhead for switching data transfer from
> machine “A”  that owns block “1000” and machine “B” that owns block
> “1001”.  The larger the block size, the less significant will be this
> overhead relative to the sequential transfer rate.  ****
>
> ** **
>
> In addition, MapR/YARN has an easier time of scheduling if there are fewer
> blocks.****
>
> --john****
>

RE: Why big block size for HDFS.

Posted by John Lilley <jo...@redpoint.net>.

From: Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
Subject: Why big block size for HDFS.

>Many places it has been written that to avoid huge no of disk seeks , we store big blocks in HDFS , so that once we seek to the location , then there is only data transfer rate which would be predominant , no more seeks. I am not sure if I have understood this correctly.
>My question is , no matter what the block size we decide , finally its getting written to the computers HDD , which would be formatted and would have a block size in KB's and also while writing to the FS (not HDFS) , its not guaranteed that the blocks that we write are continuous , so there would be disk seeks anyways .The assumption of HDFS would be only true if the underlying Fs guarentees to write the data in continuous blocks.

>Can someone explain a bit.
>Thanks,
>Rahul

While there are no guarantees that disk storage will be contiguous, the OS will attempt to keep large files contiguous (and may even defrag over time), and if all files are written using large blocks, this is more likely to be the case.  If storage is contiguous, you can write a complete track without seeking.  A complete track size varies, but a 1TB disk might have 500KB/track.  Stepping adjacent close tracks is also much cheaper than the average seek time, and as you might expect, has been optimized in hardware to assist sequential I/O.  However, if you switch storage units, you will probably encounter at least one full seek at the start of the block (since it was probably somewhere else at the time).  The result is that, on average, writing sequential files is very fast (>100MB/sec on typical SATA).  But I think that the blocks overhead has more to do with finding where to read the next block from, assuming that data has been distributed evenly.

So consider connection overhead when the data is distributed.  I am no expert on the Hadoop internals, but I suspect that somewhere, a TCP connection is opened to transfer data.  Whether connection overhead is reduced by maintaining open connection pools, I don’t know.  But let’s assume that there is *some* overhead for switching data transfer from machine “A”  that owns block “1000” and machine “B” that owns block “1001”.  The larger the block size, the less significant will be this overhead relative to the sequential transfer rate.

In addition, MapR/YARN has an easier time of scheduling if there are fewer blocks.
--john

RE: Why big block size for HDFS.

Posted by John Lilley <jo...@redpoint.net>.

From: Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
Subject: Why big block size for HDFS.

>Many places it has been written that to avoid huge no of disk seeks , we store big blocks in HDFS , so that once we seek to the location , then there is only data transfer rate which would be predominant , no more seeks. I am not sure if I have understood this correctly.
>My question is , no matter what the block size we decide , finally its getting written to the computers HDD , which would be formatted and would have a block size in KB's and also while writing to the FS (not HDFS) , its not guaranteed that the blocks that we write are continuous , so there would be disk seeks anyways .The assumption of HDFS would be only true if the underlying Fs guarentees to write the data in continuous blocks.

>Can someone explain a bit.
>Thanks,
>Rahul

While there are no guarantees that disk storage will be contiguous, the OS will attempt to keep large files contiguous (and may even defrag over time), and if all files are written using large blocks, this is more likely to be the case.  If storage is contiguous, you can write a complete track without seeking.  A complete track size varies, but a 1TB disk might have 500KB/track.  Stepping adjacent close tracks is also much cheaper than the average seek time, and as you might expect, has been optimized in hardware to assist sequential I/O.  However, if you switch storage units, you will probably encounter at least one full seek at the start of the block (since it was probably somewhere else at the time).  The result is that, on average, writing sequential files is very fast (>100MB/sec on typical SATA).  But I think that the blocks overhead has more to do with finding where to read the next block from, assuming that data has been distributed evenly.

So consider connection overhead when the data is distributed.  I am no expert on the Hadoop internals, but I suspect that somewhere, a TCP connection is opened to transfer data.  Whether connection overhead is reduced by maintaining open connection pools, I don’t know.  But let’s assume that there is *some* overhead for switching data transfer from machine “A”  that owns block “1000” and machine “B” that owns block “1001”.  The larger the block size, the less significant will be this overhead relative to the sequential transfer rate.

In addition, MapR/YARN has an easier time of scheduling if there are fewer blocks.
--john

Re: Why big block size for HDFS.

Posted by Azuryy Yu <az...@gmail.com>.

When you seek to a position within a HDFS file, you are not seek from the
start of the first block and then one by one.

Actually DFSClient can skip some blocks until find one block, which offset
and block length includes your seek position.





On Mon, Apr 1, 2013 at 12:55 AM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> Many places it has been written that to avoid huge no of disk seeks , we
> store big blocks in HDFS , so that once we seek to the location , then
> there is only data transfer rate which would be predominant , no more
> seeks. I am not sure if I have understood this correctly.
>
> My question is , no matter what the block size we decide , finally its
> getting written to the computers HDD , which would be formatted and would
> have a block size in KB's and also while writing to the FS (not HDFS) , its
> not guaranteed that the blocks that we write are continuous , so there
> would be disk seeks anyways .The assumption of HDFS would be only true if
> the underlying Fs guarentees to write the data in continuous blocks.
>
> Can someone explain a bit.
> Thanks,
> Rahul
>
>
>

Re: Why big block size for HDFS.

Posted by Azuryy Yu <az...@gmail.com>.

When you seek to a position within a HDFS file, you are not seek from the
start of the first block and then one by one.

Actually DFSClient can skip some blocks until find one block, which offset
and block length includes your seek position.





On Mon, Apr 1, 2013 at 12:55 AM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> Many places it has been written that to avoid huge no of disk seeks , we
> store big blocks in HDFS , so that once we seek to the location , then
> there is only data transfer rate which would be predominant , no more
> seeks. I am not sure if I have understood this correctly.
>
> My question is , no matter what the block size we decide , finally its
> getting written to the computers HDD , which would be formatted and would
> have a block size in KB's and also while writing to the FS (not HDFS) , its
> not guaranteed that the blocks that we write are continuous , so there
> would be disk seeks anyways .The assumption of HDFS would be only true if
> the underlying Fs guarentees to write the data in continuous blocks.
>
> Can someone explain a bit.
> Thanks,
> Rahul
>
>
>

Re: Why big block size for HDFS.

Posted by Azuryy Yu <az...@gmail.com>.

When you seek to a position within a HDFS file, you are not seek from the
start of the first block and then one by one.

Actually DFSClient can skip some blocks until find one block, which offset
and block length includes your seek position.





On Mon, Apr 1, 2013 at 12:55 AM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> Many places it has been written that to avoid huge no of disk seeks , we
> store big blocks in HDFS , so that once we seek to the location , then
> there is only data transfer rate which would be predominant , no more
> seeks. I am not sure if I have understood this correctly.
>
> My question is , no matter what the block size we decide , finally its
> getting written to the computers HDD , which would be formatted and would
> have a block size in KB's and also while writing to the FS (not HDFS) , its
> not guaranteed that the blocks that we write are continuous , so there
> would be disk seeks anyways .The assumption of HDFS would be only true if
> the underlying Fs guarentees to write the data in continuous blocks.
>
> Can someone explain a bit.
> Thanks,
> Rahul
>
>
>

RE: Why big block size for HDFS.

Posted by John Lilley <jo...@redpoint.net>.

From: Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
Subject: Why big block size for HDFS.

>Many places it has been written that to avoid huge no of disk seeks , we store big blocks in HDFS , so that once we seek to the location , then there is only data transfer rate which would be predominant , no more seeks. I am not sure if I have understood this correctly.
>My question is , no matter what the block size we decide , finally its getting written to the computers HDD , which would be formatted and would have a block size in KB's and also while writing to the FS (not HDFS) , its not guaranteed that the blocks that we write are continuous , so there would be disk seeks anyways .The assumption of HDFS would be only true if the underlying Fs guarentees to write the data in continuous blocks.

>Can someone explain a bit.
>Thanks,
>Rahul

While there are no guarantees that disk storage will be contiguous, the OS will attempt to keep large files contiguous (and may even defrag over time), and if all files are written using large blocks, this is more likely to be the case.  If storage is contiguous, you can write a complete track without seeking.  A complete track size varies, but a 1TB disk might have 500KB/track.  Stepping adjacent close tracks is also much cheaper than the average seek time, and as you might expect, has been optimized in hardware to assist sequential I/O.  However, if you switch storage units, you will probably encounter at least one full seek at the start of the block (since it was probably somewhere else at the time).  The result is that, on average, writing sequential files is very fast (>100MB/sec on typical SATA).  But I think that the blocks overhead has more to do with finding where to read the next block from, assuming that data has been distributed evenly.

So consider connection overhead when the data is distributed.  I am no expert on the Hadoop internals, but I suspect that somewhere, a TCP connection is opened to transfer data.  Whether connection overhead is reduced by maintaining open connection pools, I don’t know.  But let’s assume that there is *some* overhead for switching data transfer from machine “A”  that owns block “1000” and machine “B” that owns block “1001”.  The larger the block size, the less significant will be this overhead relative to the sequential transfer rate.

In addition, MapR/YARN has an easier time of scheduling if there are fewer blocks.
--john

Re: Why big block size for HDFS.

Posted by Azuryy Yu <az...@gmail.com>.

When you seek to a position within a HDFS file, you are not seek from the
start of the first block and then one by one.

Actually DFSClient can skip some blocks until find one block, which offset
and block length includes your seek position.





On Mon, Apr 1, 2013 at 12:55 AM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Hi,
>
> Many places it has been written that to avoid huge no of disk seeks , we
> store big blocks in HDFS , so that once we seek to the location , then
> there is only data transfer rate which would be predominant , no more
> seeks. I am not sure if I have understood this correctly.
>
> My question is , no matter what the block size we decide , finally its
> getting written to the computers HDD , which would be formatted and would
> have a block size in KB's and also while writing to the FS (not HDFS) , its
> not guaranteed that the blocks that we write are continuous , so there
> would be disk seeks anyways .The assumption of HDFS would be only true if
> the underlying Fs guarentees to write the data in continuous blocks.
>
> Can someone explain a bit.
> Thanks,
> Rahul
>
>
>

RE: Why big block size for HDFS.

Posted by John Lilley <jo...@redpoint.net>.

From: Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
Subject: Why big block size for HDFS.

>Many places it has been written that to avoid huge no of disk seeks , we store big blocks in HDFS , so that once we seek to the location , then there is only data transfer rate which would be predominant , no more seeks. I am not sure if I have understood this correctly.
>My question is , no matter what the block size we decide , finally its getting written to the computers HDD , which would be formatted and would have a block size in KB's and also while writing to the FS (not HDFS) , its not guaranteed that the blocks that we write are continuous , so there would be disk seeks anyways .The assumption of HDFS would be only true if the underlying Fs guarentees to write the data in continuous blocks.

>Can someone explain a bit.
>Thanks,
>Rahul

While there are no guarantees that disk storage will be contiguous, the OS will attempt to keep large files contiguous (and may even defrag over time), and if all files are written using large blocks, this is more likely to be the case.  If storage is contiguous, you can write a complete track without seeking.  A complete track size varies, but a 1TB disk might have 500KB/track.  Stepping adjacent close tracks is also much cheaper than the average seek time, and as you might expect, has been optimized in hardware to assist sequential I/O.  However, if you switch storage units, you will probably encounter at least one full seek at the start of the block (since it was probably somewhere else at the time).  The result is that, on average, writing sequential files is very fast (>100MB/sec on typical SATA).  But I think that the blocks overhead has more to do with finding where to read the next block from, assuming that data has been distributed evenly.

So consider connection overhead when the data is distributed.  I am no expert on the Hadoop internals, but I suspect that somewhere, a TCP connection is opened to transfer data.  Whether connection overhead is reduced by maintaining open connection pools, I don’t know.  But let’s assume that there is *some* overhead for switching data transfer from machine “A”  that owns block “1000” and machine “B” that owns block “1001”.  The larger the block size, the less significant will be this overhead relative to the sequential transfer rate.

In addition, MapR/YARN has an easier time of scheduling if there are fewer blocks.
--john