You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Jay Vyas <ja...@gmail.com> on 2012/11/13 21:30:01 UTC

Optimizing Disk I/O - does HDFS do anything ?

How does HDFS deal with optimization of file streaming?  Do data nodes have
any optimizations at the disk level for dealing with fragmented files?  I
assume not, but just curious if this is at all in the works, or if there
are java-y ways of dealing with a long running set of files in an HDFS
cluster.  MAybe, for example, data nodes could log the amount of time spent
on I/O for certain files as a way of reporting wether or not
defragmentation needed to be run on  a particular node in a cluster.

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Optimizing Disk I/O - does HDFS do anything ?

Posted by Andy Isaacson <ad...@cloudera.com>.
On Tue, Nov 13, 2012 at 1:40 PM, Jay Vyas <ja...@gmail.com> wrote:
> 1) but I thought that this sort of thing (yes even on linux) becomes
> important when you have large amounts of data - because the way files are
> written can cause issues on highly packed drives.

If you're running any filesystem at 99% full with a workload that
creates or grows files, the filesystem will experience fragmentation.
Don't do that if you want good performance.

As long as there's a few dozen GB of free space to work with, ext4 on
a modern Linux kernel (2.6.38 or newer) will do a fine job of keeping
files sequential and shouldn't need defrag.

To answer the original question -- HDFS doesn't take any special
measures to enforce defragmentation, but HDFS does follow best
practices to avoid causing fragmentation.

-andy

Re: Optimizing Disk I/O - does HDFS do anything ?

Posted by Andy Isaacson <ad...@cloudera.com>.
On Tue, Nov 13, 2012 at 1:40 PM, Jay Vyas <ja...@gmail.com> wrote:
> 1) but I thought that this sort of thing (yes even on linux) becomes
> important when you have large amounts of data - because the way files are
> written can cause issues on highly packed drives.

If you're running any filesystem at 99% full with a workload that
creates or grows files, the filesystem will experience fragmentation.
Don't do that if you want good performance.

As long as there's a few dozen GB of free space to work with, ext4 on
a modern Linux kernel (2.6.38 or newer) will do a fine job of keeping
files sequential and shouldn't need defrag.

To answer the original question -- HDFS doesn't take any special
measures to enforce defragmentation, but HDFS does follow best
practices to avoid causing fragmentation.

-andy

Re: Optimizing Disk I/O - does HDFS do anything ?

Posted by Andy Isaacson <ad...@cloudera.com>.
On Tue, Nov 13, 2012 at 1:40 PM, Jay Vyas <ja...@gmail.com> wrote:
> 1) but I thought that this sort of thing (yes even on linux) becomes
> important when you have large amounts of data - because the way files are
> written can cause issues on highly packed drives.

If you're running any filesystem at 99% full with a workload that
creates or grows files, the filesystem will experience fragmentation.
Don't do that if you want good performance.

As long as there's a few dozen GB of free space to work with, ext4 on
a modern Linux kernel (2.6.38 or newer) will do a fine job of keeping
files sequential and shouldn't need defrag.

To answer the original question -- HDFS doesn't take any special
measures to enforce defragmentation, but HDFS does follow best
practices to avoid causing fragmentation.

-andy

Re: Optimizing Disk I/O - does HDFS do anything ?

Posted by Andy Isaacson <ad...@cloudera.com>.
On Tue, Nov 13, 2012 at 1:40 PM, Jay Vyas <ja...@gmail.com> wrote:
> 1) but I thought that this sort of thing (yes even on linux) becomes
> important when you have large amounts of data - because the way files are
> written can cause issues on highly packed drives.

If you're running any filesystem at 99% full with a workload that
creates or grows files, the filesystem will experience fragmentation.
Don't do that if you want good performance.

As long as there's a few dozen GB of free space to work with, ext4 on
a modern Linux kernel (2.6.38 or newer) will do a fine job of keeping
files sequential and shouldn't need defrag.

To answer the original question -- HDFS doesn't take any special
measures to enforce defragmentation, but HDFS does follow best
practices to avoid causing fragmentation.

-andy

Re: Optimizing Disk I/O - does HDFS do anything ?

Posted by Jay Vyas <ja...@gmail.com>.
hmmm...

1) but I thought that this sort of thing (yes even on linux) becomes
important when you have large amounts of data - because the way files are
written can cause issues on highly packed drives.

2) Probably this is the key point: HDFS i/o is most effected by the file
size, which is much more important than any occasional minor disk
inhomogeneities.  So - the focus is on distributing and replicating files
rather than microoptimizing individual files.


On Tue, Nov 13, 2012 at 4:10 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> People are welcome to complement but I guess the answer is :
> 1) Hadoop is not running on windows (I am not sure if Microsoft made any
> statement about the OS used for Hadoop on Azure.)
> ->
> http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defragmenting/
> 2) files are written in one go with big blocks. (And actually, the files
> fragmentation is not the only issue. The many small files 'issue' is -in
> the end- a data fragmentation issue too and has an impact to read
> throughput.)
>
> Bertrand Dechoux
>
>
> On Tue, Nov 13, 2012 at 9:30 PM, Jay Vyas <ja...@gmail.com> wrote:
>
>> How does HDFS deal with optimization of file streaming?  Do data nodes
>> have any optimizations at the disk level for dealing with fragmented files?
>>  I assume not, but just curious if this is at all in the works, or if there
>> are java-y ways of dealing with a long running set of files in an HDFS
>> cluster.  MAybe, for example, data nodes could log the amount of time spent
>> on I/O for certain files as a way of reporting wether or not
>> defragmentation needed to be run on  a particular node in a cluster.
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Optimizing Disk I/O - does HDFS do anything ?

Posted by Jay Vyas <ja...@gmail.com>.
hmmm...

1) but I thought that this sort of thing (yes even on linux) becomes
important when you have large amounts of data - because the way files are
written can cause issues on highly packed drives.

2) Probably this is the key point: HDFS i/o is most effected by the file
size, which is much more important than any occasional minor disk
inhomogeneities.  So - the focus is on distributing and replicating files
rather than microoptimizing individual files.


On Tue, Nov 13, 2012 at 4:10 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> People are welcome to complement but I guess the answer is :
> 1) Hadoop is not running on windows (I am not sure if Microsoft made any
> statement about the OS used for Hadoop on Azure.)
> ->
> http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defragmenting/
> 2) files are written in one go with big blocks. (And actually, the files
> fragmentation is not the only issue. The many small files 'issue' is -in
> the end- a data fragmentation issue too and has an impact to read
> throughput.)
>
> Bertrand Dechoux
>
>
> On Tue, Nov 13, 2012 at 9:30 PM, Jay Vyas <ja...@gmail.com> wrote:
>
>> How does HDFS deal with optimization of file streaming?  Do data nodes
>> have any optimizations at the disk level for dealing with fragmented files?
>>  I assume not, but just curious if this is at all in the works, or if there
>> are java-y ways of dealing with a long running set of files in an HDFS
>> cluster.  MAybe, for example, data nodes could log the amount of time spent
>> on I/O for certain files as a way of reporting wether or not
>> defragmentation needed to be run on  a particular node in a cluster.
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Optimizing Disk I/O - does HDFS do anything ?

Posted by Jay Vyas <ja...@gmail.com>.
hmmm...

1) but I thought that this sort of thing (yes even on linux) becomes
important when you have large amounts of data - because the way files are
written can cause issues on highly packed drives.

2) Probably this is the key point: HDFS i/o is most effected by the file
size, which is much more important than any occasional minor disk
inhomogeneities.  So - the focus is on distributing and replicating files
rather than microoptimizing individual files.


On Tue, Nov 13, 2012 at 4:10 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> People are welcome to complement but I guess the answer is :
> 1) Hadoop is not running on windows (I am not sure if Microsoft made any
> statement about the OS used for Hadoop on Azure.)
> ->
> http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defragmenting/
> 2) files are written in one go with big blocks. (And actually, the files
> fragmentation is not the only issue. The many small files 'issue' is -in
> the end- a data fragmentation issue too and has an impact to read
> throughput.)
>
> Bertrand Dechoux
>
>
> On Tue, Nov 13, 2012 at 9:30 PM, Jay Vyas <ja...@gmail.com> wrote:
>
>> How does HDFS deal with optimization of file streaming?  Do data nodes
>> have any optimizations at the disk level for dealing with fragmented files?
>>  I assume not, but just curious if this is at all in the works, or if there
>> are java-y ways of dealing with a long running set of files in an HDFS
>> cluster.  MAybe, for example, data nodes could log the amount of time spent
>> on I/O for certain files as a way of reporting wether or not
>> defragmentation needed to be run on  a particular node in a cluster.
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Optimizing Disk I/O - does HDFS do anything ?

Posted by Jay Vyas <ja...@gmail.com>.
hmmm...

1) but I thought that this sort of thing (yes even on linux) becomes
important when you have large amounts of data - because the way files are
written can cause issues on highly packed drives.

2) Probably this is the key point: HDFS i/o is most effected by the file
size, which is much more important than any occasional minor disk
inhomogeneities.  So - the focus is on distributing and replicating files
rather than microoptimizing individual files.


On Tue, Nov 13, 2012 at 4:10 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> People are welcome to complement but I guess the answer is :
> 1) Hadoop is not running on windows (I am not sure if Microsoft made any
> statement about the OS used for Hadoop on Azure.)
> ->
> http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defragmenting/
> 2) files are written in one go with big blocks. (And actually, the files
> fragmentation is not the only issue. The many small files 'issue' is -in
> the end- a data fragmentation issue too and has an impact to read
> throughput.)
>
> Bertrand Dechoux
>
>
> On Tue, Nov 13, 2012 at 9:30 PM, Jay Vyas <ja...@gmail.com> wrote:
>
>> How does HDFS deal with optimization of file streaming?  Do data nodes
>> have any optimizations at the disk level for dealing with fragmented files?
>>  I assume not, but just curious if this is at all in the works, or if there
>> are java-y ways of dealing with a long running set of files in an HDFS
>> cluster.  MAybe, for example, data nodes could log the amount of time spent
>> on I/O for certain files as a way of reporting wether or not
>> defragmentation needed to be run on  a particular node in a cluster.
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Optimizing Disk I/O - does HDFS do anything ?

Posted by Scott Carey <sc...@richrelevance.com>.
Ext3 can be quite atrocious when it comes to fragmentation.  Simply start with an empty drive, and have 8 threads each concurrently write to their own large file sequentially.
ext4 is much better in this regard.
xfs is not as good at initial placement, but has an online defragmenter.
ext4 is fastest on a clean system but eventually can get somewhat fragmented and has no defragmentation option.
xfs is slow at meta-data operations and I would avoid it for M/R temp for that reason.


I use ext4 for M/R temp, and xfs + online defragmenter for HDFS.  The defragmenter runs nightly and has little work to do if run regularly.



On 11/13/12 1:10 PM, "Bertrand Dechoux" <de...@gmail.com>> wrote:

People are welcome to complement but I guess the answer is :
1) Hadoop is not running on windows (I am not sure if Microsoft made any statement about the OS used for Hadoop on Azure.)
-> http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defragmenting/
2) files are written in one go with big blocks. (And actually, the files fragmentation is not the only issue. The many small files 'issue' is -in the end- a data fragmentation issue too and has an impact to read throughput.)

Bertrand Dechoux

On Tue, Nov 13, 2012 at 9:30 PM, Jay Vyas <ja...@gmail.com>> wrote:
How does HDFS deal with optimization of file streaming?  Do data nodes have any optimizations at the disk level for dealing with fragmented files?  I assume not, but just curious if this is at all in the works, or if there are java-y ways of dealing with a long running set of files in an HDFS cluster.  MAybe, for example, data nodes could log the amount of time spent on I/O for certain files as a way of reporting wether or not defragmentation needed to be run on  a particular node in a cluster.

--
Jay Vyas
http://jayunit100.blogspot.com


Re: Optimizing Disk I/O - does HDFS do anything ?

Posted by Scott Carey <sc...@richrelevance.com>.
Ext3 can be quite atrocious when it comes to fragmentation.  Simply start with an empty drive, and have 8 threads each concurrently write to their own large file sequentially.
ext4 is much better in this regard.
xfs is not as good at initial placement, but has an online defragmenter.
ext4 is fastest on a clean system but eventually can get somewhat fragmented and has no defragmentation option.
xfs is slow at meta-data operations and I would avoid it for M/R temp for that reason.


I use ext4 for M/R temp, and xfs + online defragmenter for HDFS.  The defragmenter runs nightly and has little work to do if run regularly.



On 11/13/12 1:10 PM, "Bertrand Dechoux" <de...@gmail.com>> wrote:

People are welcome to complement but I guess the answer is :
1) Hadoop is not running on windows (I am not sure if Microsoft made any statement about the OS used for Hadoop on Azure.)
-> http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defragmenting/
2) files are written in one go with big blocks. (And actually, the files fragmentation is not the only issue. The many small files 'issue' is -in the end- a data fragmentation issue too and has an impact to read throughput.)

Bertrand Dechoux

On Tue, Nov 13, 2012 at 9:30 PM, Jay Vyas <ja...@gmail.com>> wrote:
How does HDFS deal with optimization of file streaming?  Do data nodes have any optimizations at the disk level for dealing with fragmented files?  I assume not, but just curious if this is at all in the works, or if there are java-y ways of dealing with a long running set of files in an HDFS cluster.  MAybe, for example, data nodes could log the amount of time spent on I/O for certain files as a way of reporting wether or not defragmentation needed to be run on  a particular node in a cluster.

--
Jay Vyas
http://jayunit100.blogspot.com


Re: Optimizing Disk I/O - does HDFS do anything ?

Posted by Scott Carey <sc...@richrelevance.com>.
Ext3 can be quite atrocious when it comes to fragmentation.  Simply start with an empty drive, and have 8 threads each concurrently write to their own large file sequentially.
ext4 is much better in this regard.
xfs is not as good at initial placement, but has an online defragmenter.
ext4 is fastest on a clean system but eventually can get somewhat fragmented and has no defragmentation option.
xfs is slow at meta-data operations and I would avoid it for M/R temp for that reason.


I use ext4 for M/R temp, and xfs + online defragmenter for HDFS.  The defragmenter runs nightly and has little work to do if run regularly.



On 11/13/12 1:10 PM, "Bertrand Dechoux" <de...@gmail.com>> wrote:

People are welcome to complement but I guess the answer is :
1) Hadoop is not running on windows (I am not sure if Microsoft made any statement about the OS used for Hadoop on Azure.)
-> http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defragmenting/
2) files are written in one go with big blocks. (And actually, the files fragmentation is not the only issue. The many small files 'issue' is -in the end- a data fragmentation issue too and has an impact to read throughput.)

Bertrand Dechoux

On Tue, Nov 13, 2012 at 9:30 PM, Jay Vyas <ja...@gmail.com>> wrote:
How does HDFS deal with optimization of file streaming?  Do data nodes have any optimizations at the disk level for dealing with fragmented files?  I assume not, but just curious if this is at all in the works, or if there are java-y ways of dealing with a long running set of files in an HDFS cluster.  MAybe, for example, data nodes could log the amount of time spent on I/O for certain files as a way of reporting wether or not defragmentation needed to be run on  a particular node in a cluster.

--
Jay Vyas
http://jayunit100.blogspot.com


Re: Optimizing Disk I/O - does HDFS do anything ?

Posted by Scott Carey <sc...@richrelevance.com>.
Ext3 can be quite atrocious when it comes to fragmentation.  Simply start with an empty drive, and have 8 threads each concurrently write to their own large file sequentially.
ext4 is much better in this regard.
xfs is not as good at initial placement, but has an online defragmenter.
ext4 is fastest on a clean system but eventually can get somewhat fragmented and has no defragmentation option.
xfs is slow at meta-data operations and I would avoid it for M/R temp for that reason.


I use ext4 for M/R temp, and xfs + online defragmenter for HDFS.  The defragmenter runs nightly and has little work to do if run regularly.



On 11/13/12 1:10 PM, "Bertrand Dechoux" <de...@gmail.com>> wrote:

People are welcome to complement but I guess the answer is :
1) Hadoop is not running on windows (I am not sure if Microsoft made any statement about the OS used for Hadoop on Azure.)
-> http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defragmenting/
2) files are written in one go with big blocks. (And actually, the files fragmentation is not the only issue. The many small files 'issue' is -in the end- a data fragmentation issue too and has an impact to read throughput.)

Bertrand Dechoux

On Tue, Nov 13, 2012 at 9:30 PM, Jay Vyas <ja...@gmail.com>> wrote:
How does HDFS deal with optimization of file streaming?  Do data nodes have any optimizations at the disk level for dealing with fragmented files?  I assume not, but just curious if this is at all in the works, or if there are java-y ways of dealing with a long running set of files in an HDFS cluster.  MAybe, for example, data nodes could log the amount of time spent on I/O for certain files as a way of reporting wether or not defragmentation needed to be run on  a particular node in a cluster.

--
Jay Vyas
http://jayunit100.blogspot.com


Re: Optimizing Disk I/O - does HDFS do anything ?

Posted by Bertrand Dechoux <de...@gmail.com>.
People are welcome to complement but I guess the answer is :
1) Hadoop is not running on windows (I am not sure if Microsoft made any
statement about the OS used for Hadoop on Azure.)
->
http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defragmenting/
2) files are written in one go with big blocks. (And actually, the files
fragmentation is not the only issue. The many small files 'issue' is -in
the end- a data fragmentation issue too and has an impact to read
throughput.)

Bertrand Dechoux

On Tue, Nov 13, 2012 at 9:30 PM, Jay Vyas <ja...@gmail.com> wrote:

> How does HDFS deal with optimization of file streaming?  Do data nodes
> have any optimizations at the disk level for dealing with fragmented files?
>  I assume not, but just curious if this is at all in the works, or if there
> are java-y ways of dealing with a long running set of files in an HDFS
> cluster.  MAybe, for example, data nodes could log the amount of time spent
> on I/O for certain files as a way of reporting wether or not
> defragmentation needed to be run on  a particular node in a cluster.
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: Optimizing Disk I/O - does HDFS do anything ?

Posted by Bertrand Dechoux <de...@gmail.com>.
People are welcome to complement but I guess the answer is :
1) Hadoop is not running on windows (I am not sure if Microsoft made any
statement about the OS used for Hadoop on Azure.)
->
http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defragmenting/
2) files are written in one go with big blocks. (And actually, the files
fragmentation is not the only issue. The many small files 'issue' is -in
the end- a data fragmentation issue too and has an impact to read
throughput.)

Bertrand Dechoux

On Tue, Nov 13, 2012 at 9:30 PM, Jay Vyas <ja...@gmail.com> wrote:

> How does HDFS deal with optimization of file streaming?  Do data nodes
> have any optimizations at the disk level for dealing with fragmented files?
>  I assume not, but just curious if this is at all in the works, or if there
> are java-y ways of dealing with a long running set of files in an HDFS
> cluster.  MAybe, for example, data nodes could log the amount of time spent
> on I/O for certain files as a way of reporting wether or not
> defragmentation needed to be run on  a particular node in a cluster.
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: Optimizing Disk I/O - does HDFS do anything ?

Posted by Bertrand Dechoux <de...@gmail.com>.
People are welcome to complement but I guess the answer is :
1) Hadoop is not running on windows (I am not sure if Microsoft made any
statement about the OS used for Hadoop on Azure.)
->
http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defragmenting/
2) files are written in one go with big blocks. (And actually, the files
fragmentation is not the only issue. The many small files 'issue' is -in
the end- a data fragmentation issue too and has an impact to read
throughput.)

Bertrand Dechoux

On Tue, Nov 13, 2012 at 9:30 PM, Jay Vyas <ja...@gmail.com> wrote:

> How does HDFS deal with optimization of file streaming?  Do data nodes
> have any optimizations at the disk level for dealing with fragmented files?
>  I assume not, but just curious if this is at all in the works, or if there
> are java-y ways of dealing with a long running set of files in an HDFS
> cluster.  MAybe, for example, data nodes could log the amount of time spent
> on I/O for certain files as a way of reporting wether or not
> defragmentation needed to be run on  a particular node in a cluster.
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: Optimizing Disk I/O - does HDFS do anything ?

Posted by Bertrand Dechoux <de...@gmail.com>.
People are welcome to complement but I guess the answer is :
1) Hadoop is not running on windows (I am not sure if Microsoft made any
statement about the OS used for Hadoop on Azure.)
->
http://www.howtogeek.com/115229/htg-explains-why-linux-doesnt-need-defragmenting/
2) files are written in one go with big blocks. (And actually, the files
fragmentation is not the only issue. The many small files 'issue' is -in
the end- a data fragmentation issue too and has an impact to read
throughput.)

Bertrand Dechoux

On Tue, Nov 13, 2012 at 9:30 PM, Jay Vyas <ja...@gmail.com> wrote:

> How does HDFS deal with optimization of file streaming?  Do data nodes
> have any optimizations at the disk level for dealing with fragmented files?
>  I assume not, but just curious if this is at all in the works, or if there
> are java-y ways of dealing with a long running set of files in an HDFS
> cluster.  MAybe, for example, data nodes could log the amount of time spent
> on I/O for certain files as a way of reporting wether or not
> defragmentation needed to be run on  a particular node in a cluster.
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>