You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Amit Chandel <am...@gmail.com> on 2009/02/09 05:06:23 UTC

using HDFS for a distributed storage system

Hi Group,

I am planning to use HDFS as a reliable and distributed file system for
batch operations. No plans as of now to run any map reduce job on top of it,
but in future we will be having map reduce operations on files in HDFS.

The current (test) system has 3 machines:
NameNode: dual core CPU, 2GB RAM, 500GB HDD
2 DataNodes: Both of them with a dual core CPU, 2GB of RAM and 1.5TB of
space with ext3 filesystem.

I just need to put and retrieve files from this system. The files which I
will put in HDFS varies from a few Bytes to a around 100MB, with the average
file-size being 5MB. and the number of files would grow around 20-50
million. To avoid hitting limit of number of files under a directory, I
store each file at the path derived by the SHA1 hash of its content (which
is 20bytes long, and I create a 10 level deep path using 2bytes for each
level). When I started the cluster a month back, I had kept the default
block size to 1MB.

The hardware specs mentioned at
http://wiki.apache.org/hadoop/MachineScalingconsiders running map
reduce operations. So not sure if my setup is good
enough. I would like to get input on this setup.
The final cluster would have each datanode with 8GB RAM, a quad core CPU,
and 25 TB attached storage.

I played with this setup a little and then planned to increase the disk
space on both the DataNodes. I started by  increasing its disk capacity of
first dataNode to 15TB and changing the underlying filesystem to XFS (which
made it a clean datanode), and put it back in the system. Before performing
this operation, I had inserted around 70000 files in HDFS.
**NameNode:50070/dfshealth.jsp
showd  *677323 files and directories, 332419 blocks = 1009742 total *. I
guess the way I create a 10 level deep path for the file results in ~10
times the number of actual files in HDFS. Please correct me if I am wrong. I
then ran the rebalancer on the cleaned up DataNode, which was too slow
(writing 2blocks per second i.e. 2MBps) to begin with and died after a few
hours saying too many open files. I checked all the machiens and all the
DataNode and NameNode processes were running fine on all the respective
machines, but the dfshealth.jsp showd both the datanodes to be dead.
Re-starting the cluster brought both of them up. I guess this has to do with
RAM requirements. My question is how to figure out the RAM requirements of
DataNode and NameNode in this situation. The documentation states that both
Datanode and NameNode stores the block index. Its not quite clear if all the
index is in memory. Once I have figured that out, how can I instruct the
hadoop to rebalance with high priority?

Another question is regarding the "Non DFS used:" statistics shown on the
dfshealth.jsp: Is it  the space used to store the files and directory
metadata information (apart from the actual file content blocks)? Right now
it is 1/4th of the total space used by HDFS.

Some points which I have thought of over the last month to improve this
model are:
1. I should keep very small files (lets say smaller than 1KB) out of HDFS.
2. Reduce the dir level of the file path created by SHA1 hash (instead of
10, I can keep 3).
3. I should increase the block size to reduce the number of blocks in HDFS (
http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200805.mbox/<
4aa34eb70805180030u5de8efaam6f1e9a8832636d42@mail.gmail.com> says it won't
result in waste of disk space).

More improvement advices are appreciated.

Thanks,
Amit

Re: using HDFS for a distributed storage system

Posted by Mark Kerzner <ma...@gmail.com>.

It is a good and useful overview,thank you. It also mentions Stuart Sierra's
post, where Stuart mentions that the process is slow. Does anybody know why?

I have written code to write from the PC file system to HDFS, and I also
noticed that it is very slow. Instead of 40M/sec, as promised by the Tom
White's book, it seems to be 40 sec/Meg. Stuart's tars would work about 5
times faster. But still, why is it so slow? Is there a way to speed this up?

Thanks!

Mark


On Mon, Feb 9, 2009 at 8:35 PM, Jeff Hammerbacher <ha...@cloudera.com>wrote:

> Yo,
>
> I don't want to sound all spammy, but Tom White wrote a pretty nice blog
> post about small files in HDFS recently that you might find helpful. The
> post covers some potential solutions, including Hadoop Archives:
> http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.
>
> Later,
> Jeff
>
> On Mon, Feb 9, 2009 at 6:14 PM, lohit <lo...@yahoo.com> wrote:
>
> > > I am planning to add the individual files initially, and after a while
> > (lets
> > > say 2 days after insertion) will make a SequenceFile out of each
> > directory
> > > (I am currently looking into SequenceFile) and delete the previous
> files
> > of
> > > that directory from HDFS. That way in future, I can access any file
> given
> > > its directory without much effort.
> >
> > Have you considered Hadoop archive?
> > http://hadoop.apache.org/core/docs/current/hadoop_archives.html
> > Depending on your access pattern, you could store files in archive step
> in
> > the first place.
> >
> >
> >
> > ----- Original Message ----
> > From: Brian Bockelman <bb...@cse.unl.edu>
> > To: core-user@hadoop.apache.org
> > Sent: Monday, February 9, 2009 4:00:42 PM
> > Subject: Re: using HDFS for a distributed storage system
> >
> > Hey Amit,
> >
> > That plan sounds much better.  I think you will find the system much more
> > scalable.
> >
> > From our experience, it takes a while to get the right amount of
> monitoring
> > and infrastructure in place to have a very dependable system with 2
> > replicas.  I would recommend using 3 replicas until you feel you've
> mastered
> > the setup.
> >
> > Brian
> >
> > On Feb 9, 2009, at 4:27 PM, Amit Chandel wrote:
> >
> > > Thanks Brian for your inputs.
> > >
> > > I am eventually targeting to store 200k directories each containing  75
> > > files on avg, with average size of directory being 300MB (ranging from
> > 50MB
> > > to 650MB) in this storage system.
> > >
> > > It will mostly be an archival storage from where I should be able to
> > access
> > > any of the old files easily. But the recent directories would be
> accessed
> > > frequently for a day or 2 as they are being added. They are added in
> > batches
> > > of 500-1000 per week, and there can be rare bursts of adding 50k
> > directories
> > > once in 3 months. One such burst is about to come in a month, and I
> want
> > to
> > > test the current test setup against that burst. We have upgraded our
> test
> > > hardware a little bit from what I last mentioned. The test setup will
> > have 3
> > > DataNodes with 15TB space on each, 6G RAM, dual core processor, and a
> > > NameNode 500G storage, 6G RAM, dual core processor.
> > >
> > > I am planning to add the individual files initially, and after a while
> > (lets
> > > say 2 days after insertion) will make a SequenceFile out of each
> > directory
> > > (I am currently looking into SequenceFile) and delete the previous
> files
> > of
> > > that directory from HDFS. That way in future, I can access any file
> given
> > > its directory without much effort.
> > > Now that SequenceFile is in picture, I can make default block size to
> > 64MB
> > > or even 128MB. For replication, I am just replicating a file at 1 extra
> > > location (i.e. replication factor = 2, since a replication factor 3
> will
> > > leave me with only 33% of the usable storage). Regarding reading back
> > from
> > > HDFS, if I can read at ~50MBps (for recent files), that would be
> enough.
> > >
> > > Let me know if you see any more pitfalls in this setup, or have more
> > > suggestions. I really appreciate it. Once I test this setup, I will put
> > the
> > > results back to the list.
> > >
> > > Thanks,
> > > Amit
> > >
> > >
> > > On Mon, Feb 9, 2009 at 12:39 PM, Brian Bockelman <bbockelm@cse.unl.edu
> > >wrote:
> > >
> > >> Hey Amit,
> > >>
> > >> Your current thoughts on keeping block size larger and removing the
> very
> > >> small files are along the right line.  Why not chose the default size
> of
> > >> 64MB or larger?  You don't seem too concerned about the number of
> > replicas.
> > >>
> > >> However, you're still fighting against the tide.  You've got enough
> > files
> > >> that you'll be pushing against block report and namenode limitations,
> > >> especially with 20 - 50 million files.  We find that about 500k blocks
> > per
> > >> node is a good stopping point right now.
> > >>
> > >> You really, really need to figure out how to organize your files in
> such
> > a
> > >> way that the average file size is above 64MB.  Is there a "primary
> key"
> > for
> > >> each file?  If so, maybe consider HBASE?  If you just are going to be
> > >> sequentially scanning through all your files, consider archiving them
> > all to
> > >> a single sequence file.
> > >>
> > >> Your individual data nodes are quite large ... I hope you're not
> > expecting
> > >> to measure throughput in 10's of Gbps?
> > >>
> > >> It's hard to give advice without knowing more about your application.
>  I
> > >> can tell you that you're going to run into a significant wall if you
> > can't
> > >> figure out a means for making your average file size at least greater
> > than
> > >> 64MB.
> > >>
> > >> Brian
> > >>
> > >> On Feb 8, 2009, at 10:06 PM, Amit Chandel wrote:
> > >>
> > >> Hi Group,
> > >>>
> > >>> I am planning to use HDFS as a reliable and distributed file system
> for
> > >>> batch operations. No plans as of now to run any map reduce job on top
> > of
> > >>> it,
> > >>> but in future we will be having map reduce operations on files in
> HDFS.
> > >>>
> > >>> The current (test) system has 3 machines:
> > >>> NameNode: dual core CPU, 2GB RAM, 500GB HDD
> > >>> 2 DataNodes: Both of them with a dual core CPU, 2GB of RAM and 1.5TB
> of
> > >>> space with ext3 filesystem.
> > >>>
> > >>> I just need to put and retrieve files from this system. The files
> which
> > I
> > >>> will put in HDFS varies from a few Bytes to a around 100MB, with the
> > >>> average
> > >>> file-size being 5MB. and the number of files would grow around 20-50
> > >>> million. To avoid hitting limit of number of files under a directory,
> I
> > >>> store each file at the path derived by the SHA1 hash of its content
> > (which
> > >>> is 20bytes long, and I create a 10 level deep path using 2bytes for
> > each
> > >>> level). When I started the cluster a month back, I had kept the
> default
> > >>> block size to 1MB.
> > >>>
> > >>> The hardware specs mentioned at
> > >>> http://wiki.apache.org/hadoop/MachineScalingconsiders running map
> > >>>
> > >>> reduce operations. So not sure if my setup is good
> > >>> enough. I would like to get input on this setup.
> > >>> The final cluster would have each datanode with 8GB RAM, a quad core
> > CPU,
> > >>> and 25 TB attached storage.
> > >>>
> > >>> I played with this setup a little and then planned to increase the
> disk
> > >>> space on both the DataNodes. I started by  increasing its disk
> capacity
> > of
> > >>> first dataNode to 15TB and changing the underlying filesystem to XFS
> > >>> (which
> > >>> made it a clean datanode), and put it back in the system. Before
> > >>> performing
> > >>> this operation, I had inserted around 70000 files in HDFS.
> > >>> **NameNode:50070/dfshealth.jsp
> > >>> showd  *677323 files and directories, 332419 blocks = 1009742 total
> *.
> > I
> > >>> guess the way I create a 10 level deep path for the file results in
> ~10
> > >>> times the number of actual files in HDFS. Please correct me if I am
> > wrong.
> > >>> I
> > >>> then ran the rebalancer on the cleaned up DataNode, which was too
> slow
> > >>> (writing 2blocks per second i.e. 2MBps) to begin with and died after
> a
> > few
> > >>> hours saying too many open files. I checked all the machiens and all
> > the
> > >>> DataNode and NameNode processes were running fine on all the
> respective
> > >>> machines, but the dfshealth.jsp showd both the datanodes to be dead.
> > >>> Re-starting the cluster brought both of them up. I guess this has to
> do
> > >>> with
> > >>> RAM requirements. My question is how to figure out the RAM
> requirements
> > of
> > >>> DataNode and NameNode in this situation. The documentation states
> that
> > >>> both
> > >>> Datanode and NameNode stores the block index. Its not quite clear if
> > all
> > >>> the
> > >>> index is in memory. Once I have figured that out, how can I instruct
> > the
> > >>> hadoop to rebalance with high priority?
> > >>>
> > >>> Another question is regarding the "Non DFS used:" statistics shown on
> > the
> > >>> dfshealth.jsp: Is it  the space used to store the files and directory
> > >>> metadata information (apart from the actual file content blocks)?
> Right
> > >>> now
> > >>> it is 1/4th of the total space used by HDFS.
> > >>>
> > >>> Some points which I have thought of over the last month to improve
> this
> > >>> model are:
> > >>> 1. I should keep very small files (lets say smaller than 1KB) out of
> > HDFS.
> > >>> 2. Reduce the dir level of the file path created by SHA1 hash
> (instead
> > of
> > >>> 10, I can keep 3).
> > >>> 3. I should increase the block size to reduce the number of blocks in
> > HDFS
> > >>> (
> > >>>
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200805.mbox/
> > <
> > >>> 4aa34eb70805180030u5de8efaam6f1e9a8832636d42@mail.gmail.com> says it
> > >>> won't
> > >>> result in waste of disk space).
> > >>>
> > >>> More improvement advices are appreciated.
> > >>>
> > >>> Thanks,
> > >>> Amit
> > >>>
> > >>
> > >>
> >
>

Re: using HDFS for a distributed storage system

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

Yo,

I don't want to sound all spammy, but Tom White wrote a pretty nice blog
post about small files in HDFS recently that you might find helpful. The
post covers some potential solutions, including Hadoop Archives:
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.

Later,
Jeff

On Mon, Feb 9, 2009 at 6:14 PM, lohit <lo...@yahoo.com> wrote:

> > I am planning to add the individual files initially, and after a while
> (lets
> > say 2 days after insertion) will make a SequenceFile out of each
> directory
> > (I am currently looking into SequenceFile) and delete the previous files
> of
> > that directory from HDFS. That way in future, I can access any file given
> > its directory without much effort.
>
> Have you considered Hadoop archive?
> http://hadoop.apache.org/core/docs/current/hadoop_archives.html
> Depending on your access pattern, you could store files in archive step in
> the first place.
>
>
>
> ----- Original Message ----
> From: Brian Bockelman <bb...@cse.unl.edu>
> To: core-user@hadoop.apache.org
> Sent: Monday, February 9, 2009 4:00:42 PM
> Subject: Re: using HDFS for a distributed storage system
>
> Hey Amit,
>
> That plan sounds much better.  I think you will find the system much more
> scalable.
>
> From our experience, it takes a while to get the right amount of monitoring
> and infrastructure in place to have a very dependable system with 2
> replicas.  I would recommend using 3 replicas until you feel you've mastered
> the setup.
>
> Brian
>
> On Feb 9, 2009, at 4:27 PM, Amit Chandel wrote:
>
> > Thanks Brian for your inputs.
> >
> > I am eventually targeting to store 200k directories each containing  75
> > files on avg, with average size of directory being 300MB (ranging from
> 50MB
> > to 650MB) in this storage system.
> >
> > It will mostly be an archival storage from where I should be able to
> access
> > any of the old files easily. But the recent directories would be accessed
> > frequently for a day or 2 as they are being added. They are added in
> batches
> > of 500-1000 per week, and there can be rare bursts of adding 50k
> directories
> > once in 3 months. One such burst is about to come in a month, and I want
> to
> > test the current test setup against that burst. We have upgraded our test
> > hardware a little bit from what I last mentioned. The test setup will
> have 3
> > DataNodes with 15TB space on each, 6G RAM, dual core processor, and a
> > NameNode 500G storage, 6G RAM, dual core processor.
> >
> > I am planning to add the individual files initially, and after a while
> (lets
> > say 2 days after insertion) will make a SequenceFile out of each
> directory
> > (I am currently looking into SequenceFile) and delete the previous files
> of
> > that directory from HDFS. That way in future, I can access any file given
> > its directory without much effort.
> > Now that SequenceFile is in picture, I can make default block size to
> 64MB
> > or even 128MB. For replication, I am just replicating a file at 1 extra
> > location (i.e. replication factor = 2, since a replication factor 3 will
> > leave me with only 33% of the usable storage). Regarding reading back
> from
> > HDFS, if I can read at ~50MBps (for recent files), that would be enough.
> >
> > Let me know if you see any more pitfalls in this setup, or have more
> > suggestions. I really appreciate it. Once I test this setup, I will put
> the
> > results back to the list.
> >
> > Thanks,
> > Amit
> >
> >
> > On Mon, Feb 9, 2009 at 12:39 PM, Brian Bockelman <bbockelm@cse.unl.edu
> >wrote:
> >
> >> Hey Amit,
> >>
> >> Your current thoughts on keeping block size larger and removing the very
> >> small files are along the right line.  Why not chose the default size of
> >> 64MB or larger?  You don't seem too concerned about the number of
> replicas.
> >>
> >> However, you're still fighting against the tide.  You've got enough
> files
> >> that you'll be pushing against block report and namenode limitations,
> >> especially with 20 - 50 million files.  We find that about 500k blocks
> per
> >> node is a good stopping point right now.
> >>
> >> You really, really need to figure out how to organize your files in such
> a
> >> way that the average file size is above 64MB.  Is there a "primary key"
> for
> >> each file?  If so, maybe consider HBASE?  If you just are going to be
> >> sequentially scanning through all your files, consider archiving them
> all to
> >> a single sequence file.
> >>
> >> Your individual data nodes are quite large ... I hope you're not
> expecting
> >> to measure throughput in 10's of Gbps?
> >>
> >> It's hard to give advice without knowing more about your application.  I
> >> can tell you that you're going to run into a significant wall if you
> can't
> >> figure out a means for making your average file size at least greater
> than
> >> 64MB.
> >>
> >> Brian
> >>
> >> On Feb 8, 2009, at 10:06 PM, Amit Chandel wrote:
> >>
> >> Hi Group,
> >>>
> >>> I am planning to use HDFS as a reliable and distributed file system for
> >>> batch operations. No plans as of now to run any map reduce job on top
> of
> >>> it,
> >>> but in future we will be having map reduce operations on files in HDFS.
> >>>
> >>> The current (test) system has 3 machines:
> >>> NameNode: dual core CPU, 2GB RAM, 500GB HDD
> >>> 2 DataNodes: Both of them with a dual core CPU, 2GB of RAM and 1.5TB of
> >>> space with ext3 filesystem.
> >>>
> >>> I just need to put and retrieve files from this system. The files which
> I
> >>> will put in HDFS varies from a few Bytes to a around 100MB, with the
> >>> average
> >>> file-size being 5MB. and the number of files would grow around 20-50
> >>> million. To avoid hitting limit of number of files under a directory, I
> >>> store each file at the path derived by the SHA1 hash of its content
> (which
> >>> is 20bytes long, and I create a 10 level deep path using 2bytes for
> each
> >>> level). When I started the cluster a month back, I had kept the default
> >>> block size to 1MB.
> >>>
> >>> The hardware specs mentioned at
> >>> http://wiki.apache.org/hadoop/MachineScalingconsiders running map
> >>>
> >>> reduce operations. So not sure if my setup is good
> >>> enough. I would like to get input on this setup.
> >>> The final cluster would have each datanode with 8GB RAM, a quad core
> CPU,
> >>> and 25 TB attached storage.
> >>>
> >>> I played with this setup a little and then planned to increase the disk
> >>> space on both the DataNodes. I started by  increasing its disk capacity
> of
> >>> first dataNode to 15TB and changing the underlying filesystem to XFS
> >>> (which
> >>> made it a clean datanode), and put it back in the system. Before
> >>> performing
> >>> this operation, I had inserted around 70000 files in HDFS.
> >>> **NameNode:50070/dfshealth.jsp
> >>> showd  *677323 files and directories, 332419 blocks = 1009742 total *.
> I
> >>> guess the way I create a 10 level deep path for the file results in ~10
> >>> times the number of actual files in HDFS. Please correct me if I am
> wrong.
> >>> I
> >>> then ran the rebalancer on the cleaned up DataNode, which was too slow
> >>> (writing 2blocks per second i.e. 2MBps) to begin with and died after a
> few
> >>> hours saying too many open files. I checked all the machiens and all
> the
> >>> DataNode and NameNode processes were running fine on all the respective
> >>> machines, but the dfshealth.jsp showd both the datanodes to be dead.
> >>> Re-starting the cluster brought both of them up. I guess this has to do
> >>> with
> >>> RAM requirements. My question is how to figure out the RAM requirements
> of
> >>> DataNode and NameNode in this situation. The documentation states that
> >>> both
> >>> Datanode and NameNode stores the block index. Its not quite clear if
> all
> >>> the
> >>> index is in memory. Once I have figured that out, how can I instruct
> the
> >>> hadoop to rebalance with high priority?
> >>>
> >>> Another question is regarding the "Non DFS used:" statistics shown on
> the
> >>> dfshealth.jsp: Is it  the space used to store the files and directory
> >>> metadata information (apart from the actual file content blocks)? Right
> >>> now
> >>> it is 1/4th of the total space used by HDFS.
> >>>
> >>> Some points which I have thought of over the last month to improve this
> >>> model are:
> >>> 1. I should keep very small files (lets say smaller than 1KB) out of
> HDFS.
> >>> 2. Reduce the dir level of the file path created by SHA1 hash (instead
> of
> >>> 10, I can keep 3).
> >>> 3. I should increase the block size to reduce the number of blocks in
> HDFS
> >>> (
> >>> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200805.mbox/
> <
> >>> 4aa34eb70805180030u5de8efaam6f1e9a8832636d42@mail.gmail.com> says it
> >>> won't
> >>> result in waste of disk space).
> >>>
> >>> More improvement advices are appreciated.
> >>>
> >>> Thanks,
> >>> Amit
> >>>
> >>
> >>
>

Re: using HDFS for a distributed storage system

Posted by lohit <lo...@yahoo.com>.

> I am planning to add the individual files initially, and after a while (lets
> say 2 days after insertion) will make a SequenceFile out of each directory
> (I am currently looking into SequenceFile) and delete the previous files of
> that directory from HDFS. That way in future, I can access any file given
> its directory without much effort.

Have you considered Hadoop archive? 
http://hadoop.apache.org/core/docs/current/hadoop_archives.html
Depending on your access pattern, you could store files in archive step in the first place.



----- Original Message ----
From: Brian Bockelman <bb...@cse.unl.edu>
To: core-user@hadoop.apache.org
Sent: Monday, February 9, 2009 4:00:42 PM
Subject: Re: using HDFS for a distributed storage system

Hey Amit,

That plan sounds much better.  I think you will find the system much more scalable.

>From our experience, it takes a while to get the right amount of monitoring and infrastructure in place to have a very dependable system with 2 replicas.  I would recommend using 3 replicas until you feel you've mastered the setup.

Brian

On Feb 9, 2009, at 4:27 PM, Amit Chandel wrote:

> Thanks Brian for your inputs.
> 
> I am eventually targeting to store 200k directories each containing  75
> files on avg, with average size of directory being 300MB (ranging from 50MB
> to 650MB) in this storage system.
> 
> It will mostly be an archival storage from where I should be able to access
> any of the old files easily. But the recent directories would be accessed
> frequently for a day or 2 as they are being added. They are added in batches
> of 500-1000 per week, and there can be rare bursts of adding 50k directories
> once in 3 months. One such burst is about to come in a month, and I want to
> test the current test setup against that burst. We have upgraded our test
> hardware a little bit from what I last mentioned. The test setup will have 3
> DataNodes with 15TB space on each, 6G RAM, dual core processor, and a
> NameNode 500G storage, 6G RAM, dual core processor.
> 
> I am planning to add the individual files initially, and after a while (lets
> say 2 days after insertion) will make a SequenceFile out of each directory
> (I am currently looking into SequenceFile) and delete the previous files of
> that directory from HDFS. That way in future, I can access any file given
> its directory without much effort.
> Now that SequenceFile is in picture, I can make default block size to 64MB
> or even 128MB. For replication, I am just replicating a file at 1 extra
> location (i.e. replication factor = 2, since a replication factor 3 will
> leave me with only 33% of the usable storage). Regarding reading back from
> HDFS, if I can read at ~50MBps (for recent files), that would be enough.
> 
> Let me know if you see any more pitfalls in this setup, or have more
> suggestions. I really appreciate it. Once I test this setup, I will put the
> results back to the list.
> 
> Thanks,
> Amit
> 
> 
> On Mon, Feb 9, 2009 at 12:39 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:
> 
>> Hey Amit,
>> 
>> Your current thoughts on keeping block size larger and removing the very
>> small files are along the right line.  Why not chose the default size of
>> 64MB or larger?  You don't seem too concerned about the number of replicas.
>> 
>> However, you're still fighting against the tide.  You've got enough files
>> that you'll be pushing against block report and namenode limitations,
>> especially with 20 - 50 million files.  We find that about 500k blocks per
>> node is a good stopping point right now.
>> 
>> You really, really need to figure out how to organize your files in such a
>> way that the average file size is above 64MB.  Is there a "primary key" for
>> each file?  If so, maybe consider HBASE?  If you just are going to be
>> sequentially scanning through all your files, consider archiving them all to
>> a single sequence file.
>> 
>> Your individual data nodes are quite large ... I hope you're not expecting
>> to measure throughput in 10's of Gbps?
>> 
>> It's hard to give advice without knowing more about your application.  I
>> can tell you that you're going to run into a significant wall if you can't
>> figure out a means for making your average file size at least greater than
>> 64MB.
>> 
>> Brian
>> 
>> On Feb 8, 2009, at 10:06 PM, Amit Chandel wrote:
>> 
>> Hi Group,
>>> 
>>> I am planning to use HDFS as a reliable and distributed file system for
>>> batch operations. No plans as of now to run any map reduce job on top of
>>> it,
>>> but in future we will be having map reduce operations on files in HDFS.
>>> 
>>> The current (test) system has 3 machines:
>>> NameNode: dual core CPU, 2GB RAM, 500GB HDD
>>> 2 DataNodes: Both of them with a dual core CPU, 2GB of RAM and 1.5TB of
>>> space with ext3 filesystem.
>>> 
>>> I just need to put and retrieve files from this system. The files which I
>>> will put in HDFS varies from a few Bytes to a around 100MB, with the
>>> average
>>> file-size being 5MB. and the number of files would grow around 20-50
>>> million. To avoid hitting limit of number of files under a directory, I
>>> store each file at the path derived by the SHA1 hash of its content (which
>>> is 20bytes long, and I create a 10 level deep path using 2bytes for each
>>> level). When I started the cluster a month back, I had kept the default
>>> block size to 1MB.
>>> 
>>> The hardware specs mentioned at
>>> http://wiki.apache.org/hadoop/MachineScalingconsiders running map
>>> 
>>> reduce operations. So not sure if my setup is good
>>> enough. I would like to get input on this setup.
>>> The final cluster would have each datanode with 8GB RAM, a quad core CPU,
>>> and 25 TB attached storage.
>>> 
>>> I played with this setup a little and then planned to increase the disk
>>> space on both the DataNodes. I started by  increasing its disk capacity of
>>> first dataNode to 15TB and changing the underlying filesystem to XFS
>>> (which
>>> made it a clean datanode), and put it back in the system. Before
>>> performing
>>> this operation, I had inserted around 70000 files in HDFS.
>>> **NameNode:50070/dfshealth.jsp
>>> showd  *677323 files and directories, 332419 blocks = 1009742 total *. I
>>> guess the way I create a 10 level deep path for the file results in ~10
>>> times the number of actual files in HDFS. Please correct me if I am wrong.
>>> I
>>> then ran the rebalancer on the cleaned up DataNode, which was too slow
>>> (writing 2blocks per second i.e. 2MBps) to begin with and died after a few
>>> hours saying too many open files. I checked all the machiens and all the
>>> DataNode and NameNode processes were running fine on all the respective
>>> machines, but the dfshealth.jsp showd both the datanodes to be dead.
>>> Re-starting the cluster brought both of them up. I guess this has to do
>>> with
>>> RAM requirements. My question is how to figure out the RAM requirements of
>>> DataNode and NameNode in this situation. The documentation states that
>>> both
>>> Datanode and NameNode stores the block index. Its not quite clear if all
>>> the
>>> index is in memory. Once I have figured that out, how can I instruct the
>>> hadoop to rebalance with high priority?
>>> 
>>> Another question is regarding the "Non DFS used:" statistics shown on the
>>> dfshealth.jsp: Is it  the space used to store the files and directory
>>> metadata information (apart from the actual file content blocks)? Right
>>> now
>>> it is 1/4th of the total space used by HDFS.
>>> 
>>> Some points which I have thought of over the last month to improve this
>>> model are:
>>> 1. I should keep very small files (lets say smaller than 1KB) out of HDFS.
>>> 2. Reduce the dir level of the file path created by SHA1 hash (instead of
>>> 10, I can keep 3).
>>> 3. I should increase the block size to reduce the number of blocks in HDFS
>>> (
>>> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200805.mbox/<
>>> 4aa34eb70805180030u5de8efaam6f1e9a8832636d42@mail.gmail.com> says it
>>> won't
>>> result in waste of disk space).
>>> 
>>> More improvement advices are appreciated.
>>> 
>>> Thanks,
>>> Amit
>>> 
>> 
>>

Re: using HDFS for a distributed storage system

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hey Amit,

That plan sounds much better.  I think you will find the system much  
more scalable.

 From our experience, it takes a while to get the right amount of  
monitoring and infrastructure in place to have a very dependable  
system with 2 replicas.  I would recommend using 3 replicas until you  
feel you've mastered the setup.

Brian

On Feb 9, 2009, at 4:27 PM, Amit Chandel wrote:

> Thanks Brian for your inputs.
>
> I am eventually targeting to store 200k directories each containing   
> 75
> files on avg, with average size of directory being 300MB (ranging  
> from 50MB
> to 650MB) in this storage system.
>
> It will mostly be an archival storage from where I should be able to  
> access
> any of the old files easily. But the recent directories would be  
> accessed
> frequently for a day or 2 as they are being added. They are added in  
> batches
> of 500-1000 per week, and there can be rare bursts of adding 50k  
> directories
> once in 3 months. One such burst is about to come in a month, and I  
> want to
> test the current test setup against that burst. We have upgraded our  
> test
> hardware a little bit from what I last mentioned. The test setup  
> will have 3
> DataNodes with 15TB space on each, 6G RAM, dual core processor, and a
> NameNode 500G storage, 6G RAM, dual core processor.
>
> I am planning to add the individual files initially, and after a  
> while (lets
> say 2 days after insertion) will make a SequenceFile out of each  
> directory
> (I am currently looking into SequenceFile) and delete the previous  
> files of
> that directory from HDFS. That way in future, I can access any file  
> given
> its directory without much effort.
> Now that SequenceFile is in picture, I can make default block size  
> to 64MB
> or even 128MB. For replication, I am just replicating a file at 1  
> extra
> location (i.e. replication factor = 2, since a replication factor 3  
> will
> leave me with only 33% of the usable storage). Regarding reading  
> back from
> HDFS, if I can read at ~50MBps (for recent files), that would be  
> enough.
>
> Let me know if you see any more pitfalls in this setup, or have more
> suggestions. I really appreciate it. Once I test this setup, I will  
> put the
> results back to the list.
>
> Thanks,
> Amit
>
>
> On Mon, Feb 9, 2009 at 12:39 PM, Brian Bockelman  
> <bb...@cse.unl.edu>wrote:
>
>> Hey Amit,
>>
>> Your current thoughts on keeping block size larger and removing the  
>> very
>> small files are along the right line.  Why not chose the default  
>> size of
>> 64MB or larger?  You don't seem too concerned about the number of  
>> replicas.
>>
>> However, you're still fighting against the tide.  You've got enough  
>> files
>> that you'll be pushing against block report and namenode limitations,
>> especially with 20 - 50 million files.  We find that about 500k  
>> blocks per
>> node is a good stopping point right now.
>>
>> You really, really need to figure out how to organize your files in  
>> such a
>> way that the average file size is above 64MB.  Is there a "primary  
>> key" for
>> each file?  If so, maybe consider HBASE?  If you just are going to be
>> sequentially scanning through all your files, consider archiving  
>> them all to
>> a single sequence file.
>>
>> Your individual data nodes are quite large ... I hope you're not  
>> expecting
>> to measure throughput in 10's of Gbps?
>>
>> It's hard to give advice without knowing more about your  
>> application.  I
>> can tell you that you're going to run into a significant wall if  
>> you can't
>> figure out a means for making your average file size at least  
>> greater than
>> 64MB.
>>
>> Brian
>>
>> On Feb 8, 2009, at 10:06 PM, Amit Chandel wrote:
>>
>> Hi Group,
>>>
>>> I am planning to use HDFS as a reliable and distributed file  
>>> system for
>>> batch operations. No plans as of now to run any map reduce job on  
>>> top of
>>> it,
>>> but in future we will be having map reduce operations on files in  
>>> HDFS.
>>>
>>> The current (test) system has 3 machines:
>>> NameNode: dual core CPU, 2GB RAM, 500GB HDD
>>> 2 DataNodes: Both of them with a dual core CPU, 2GB of RAM and  
>>> 1.5TB of
>>> space with ext3 filesystem.
>>>
>>> I just need to put and retrieve files from this system. The files  
>>> which I
>>> will put in HDFS varies from a few Bytes to a around 100MB, with the
>>> average
>>> file-size being 5MB. and the number of files would grow around 20-50
>>> million. To avoid hitting limit of number of files under a  
>>> directory, I
>>> store each file at the path derived by the SHA1 hash of its  
>>> content (which
>>> is 20bytes long, and I create a 10 level deep path using 2bytes  
>>> for each
>>> level). When I started the cluster a month back, I had kept the  
>>> default
>>> block size to 1MB.
>>>
>>> The hardware specs mentioned at
>>> http://wiki.apache.org/hadoop/MachineScalingconsiders running map
>>>
>>> reduce operations. So not sure if my setup is good
>>> enough. I would like to get input on this setup.
>>> The final cluster would have each datanode with 8GB RAM, a quad  
>>> core CPU,
>>> and 25 TB attached storage.
>>>
>>> I played with this setup a little and then planned to increase the  
>>> disk
>>> space on both the DataNodes. I started by  increasing its disk  
>>> capacity of
>>> first dataNode to 15TB and changing the underlying filesystem to XFS
>>> (which
>>> made it a clean datanode), and put it back in the system. Before
>>> performing
>>> this operation, I had inserted around 70000 files in HDFS.
>>> **NameNode:50070/dfshealth.jsp
>>> showd  *677323 files and directories, 332419 blocks = 1009742  
>>> total *. I
>>> guess the way I create a 10 level deep path for the file results  
>>> in ~10
>>> times the number of actual files in HDFS. Please correct me if I  
>>> am wrong.
>>> I
>>> then ran the rebalancer on the cleaned up DataNode, which was too  
>>> slow
>>> (writing 2blocks per second i.e. 2MBps) to begin with and died  
>>> after a few
>>> hours saying too many open files. I checked all the machiens and  
>>> all the
>>> DataNode and NameNode processes were running fine on all the  
>>> respective
>>> machines, but the dfshealth.jsp showd both the datanodes to be dead.
>>> Re-starting the cluster brought both of them up. I guess this has  
>>> to do
>>> with
>>> RAM requirements. My question is how to figure out the RAM  
>>> requirements of
>>> DataNode and NameNode in this situation. The documentation states  
>>> that
>>> both
>>> Datanode and NameNode stores the block index. Its not quite clear  
>>> if all
>>> the
>>> index is in memory. Once I have figured that out, how can I  
>>> instruct the
>>> hadoop to rebalance with high priority?
>>>
>>> Another question is regarding the "Non DFS used:" statistics shown  
>>> on the
>>> dfshealth.jsp: Is it  the space used to store the files and  
>>> directory
>>> metadata information (apart from the actual file content blocks)?  
>>> Right
>>> now
>>> it is 1/4th of the total space used by HDFS.
>>>
>>> Some points which I have thought of over the last month to improve  
>>> this
>>> model are:
>>> 1. I should keep very small files (lets say smaller than 1KB) out  
>>> of HDFS.
>>> 2. Reduce the dir level of the file path created by SHA1 hash  
>>> (instead of
>>> 10, I can keep 3).
>>> 3. I should increase the block size to reduce the number of blocks  
>>> in HDFS
>>> (
>>> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200805.mbox/ 
>>> <
>>> 4aa34eb70805180030u5de8efaam6f1e9a8832636d42@mail.gmail.com> says it
>>> won't
>>> result in waste of disk space).
>>>
>>> More improvement advices are appreciated.
>>>
>>> Thanks,
>>> Amit
>>>
>>
>>

Re: using HDFS for a distributed storage system

Posted by Amit Chandel <am...@gmail.com>.

Thanks Brian for your inputs.

I am eventually targeting to store 200k directories each containing  75
files on avg, with average size of directory being 300MB (ranging from 50MB
to 650MB) in this storage system.

It will mostly be an archival storage from where I should be able to access
any of the old files easily. But the recent directories would be accessed
frequently for a day or 2 as they are being added. They are added in batches
of 500-1000 per week, and there can be rare bursts of adding 50k directories
once in 3 months. One such burst is about to come in a month, and I want to
test the current test setup against that burst. We have upgraded our test
hardware a little bit from what I last mentioned. The test setup will have 3
DataNodes with 15TB space on each, 6G RAM, dual core processor, and a
NameNode 500G storage, 6G RAM, dual core processor.

I am planning to add the individual files initially, and after a while (lets
say 2 days after insertion) will make a SequenceFile out of each directory
(I am currently looking into SequenceFile) and delete the previous files of
that directory from HDFS. That way in future, I can access any file given
its directory without much effort.
Now that SequenceFile is in picture, I can make default block size to 64MB
or even 128MB. For replication, I am just replicating a file at 1 extra
location (i.e. replication factor = 2, since a replication factor 3 will
leave me with only 33% of the usable storage). Regarding reading back from
HDFS, if I can read at ~50MBps (for recent files), that would be enough.

Let me know if you see any more pitfalls in this setup, or have more
suggestions. I really appreciate it. Once I test this setup, I will put the
results back to the list.

Thanks,
Amit


On Mon, Feb 9, 2009 at 12:39 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Hey Amit,
>
> Your current thoughts on keeping block size larger and removing the very
> small files are along the right line.  Why not chose the default size of
> 64MB or larger?  You don't seem too concerned about the number of replicas.
>
> However, you're still fighting against the tide.  You've got enough files
> that you'll be pushing against block report and namenode limitations,
> especially with 20 - 50 million files.  We find that about 500k blocks per
> node is a good stopping point right now.
>
> You really, really need to figure out how to organize your files in such a
> way that the average file size is above 64MB.  Is there a "primary key" for
> each file?  If so, maybe consider HBASE?  If you just are going to be
> sequentially scanning through all your files, consider archiving them all to
> a single sequence file.
>
> Your individual data nodes are quite large ... I hope you're not expecting
> to measure throughput in 10's of Gbps?
>
> It's hard to give advice without knowing more about your application.  I
> can tell you that you're going to run into a significant wall if you can't
> figure out a means for making your average file size at least greater than
> 64MB.
>
> Brian
>
> On Feb 8, 2009, at 10:06 PM, Amit Chandel wrote:
>
>  Hi Group,
>>
>> I am planning to use HDFS as a reliable and distributed file system for
>> batch operations. No plans as of now to run any map reduce job on top of
>> it,
>> but in future we will be having map reduce operations on files in HDFS.
>>
>> The current (test) system has 3 machines:
>> NameNode: dual core CPU, 2GB RAM, 500GB HDD
>> 2 DataNodes: Both of them with a dual core CPU, 2GB of RAM and 1.5TB of
>> space with ext3 filesystem.
>>
>> I just need to put and retrieve files from this system. The files which I
>> will put in HDFS varies from a few Bytes to a around 100MB, with the
>> average
>> file-size being 5MB. and the number of files would grow around 20-50
>> million. To avoid hitting limit of number of files under a directory, I
>> store each file at the path derived by the SHA1 hash of its content (which
>> is 20bytes long, and I create a 10 level deep path using 2bytes for each
>> level). When I started the cluster a month back, I had kept the default
>> block size to 1MB.
>>
>> The hardware specs mentioned at
>> http://wiki.apache.org/hadoop/MachineScalingconsiders running map
>>
>> reduce operations. So not sure if my setup is good
>> enough. I would like to get input on this setup.
>> The final cluster would have each datanode with 8GB RAM, a quad core CPU,
>> and 25 TB attached storage.
>>
>> I played with this setup a little and then planned to increase the disk
>> space on both the DataNodes. I started by  increasing its disk capacity of
>> first dataNode to 15TB and changing the underlying filesystem to XFS
>> (which
>> made it a clean datanode), and put it back in the system. Before
>> performing
>> this operation, I had inserted around 70000 files in HDFS.
>> **NameNode:50070/dfshealth.jsp
>> showd  *677323 files and directories, 332419 blocks = 1009742 total *. I
>> guess the way I create a 10 level deep path for the file results in ~10
>> times the number of actual files in HDFS. Please correct me if I am wrong.
>> I
>> then ran the rebalancer on the cleaned up DataNode, which was too slow
>> (writing 2blocks per second i.e. 2MBps) to begin with and died after a few
>> hours saying too many open files. I checked all the machiens and all the
>> DataNode and NameNode processes were running fine on all the respective
>> machines, but the dfshealth.jsp showd both the datanodes to be dead.
>> Re-starting the cluster brought both of them up. I guess this has to do
>> with
>> RAM requirements. My question is how to figure out the RAM requirements of
>> DataNode and NameNode in this situation. The documentation states that
>> both
>> Datanode and NameNode stores the block index. Its not quite clear if all
>> the
>> index is in memory. Once I have figured that out, how can I instruct the
>> hadoop to rebalance with high priority?
>>
>> Another question is regarding the "Non DFS used:" statistics shown on the
>> dfshealth.jsp: Is it  the space used to store the files and directory
>> metadata information (apart from the actual file content blocks)? Right
>> now
>> it is 1/4th of the total space used by HDFS.
>>
>> Some points which I have thought of over the last month to improve this
>> model are:
>> 1. I should keep very small files (lets say smaller than 1KB) out of HDFS.
>> 2. Reduce the dir level of the file path created by SHA1 hash (instead of
>> 10, I can keep 3).
>> 3. I should increase the block size to reduce the number of blocks in HDFS
>> (
>> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200805.mbox/<
>> 4aa34eb70805180030u5de8efaam6f1e9a8832636d42@mail.gmail.com> says it
>> won't
>> result in waste of disk space).
>>
>> More improvement advices are appreciated.
>>
>> Thanks,
>> Amit
>>
>
>

Re: using HDFS for a distributed storage system

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hey Amit,

Your current thoughts on keeping block size larger and removing the  
very small files are along the right line.  Why not chose the default  
size of 64MB or larger?  You don't seem too concerned about the number  
of replicas.

However, you're still fighting against the tide.  You've got enough  
files that you'll be pushing against block report and namenode  
limitations, especially with 20 - 50 million files.  We find that  
about 500k blocks per node is a good stopping point right now.

You really, really need to figure out how to organize your files in  
such a way that the average file size is above 64MB.  Is there a  
"primary key" for each file?  If so, maybe consider HBASE?  If you  
just are going to be sequentially scanning through all your files,  
consider archiving them all to a single sequence file.

Your individual data nodes are quite large ... I hope you're not  
expecting to measure throughput in 10's of Gbps?

It's hard to give advice without knowing more about your application.   
I can tell you that you're going to run into a significant wall if you  
can't figure out a means for making your average file size at least  
greater than 64MB.

Brian

On Feb 8, 2009, at 10:06 PM, Amit Chandel wrote:

> Hi Group,
>
> I am planning to use HDFS as a reliable and distributed file system  
> for
> batch operations. No plans as of now to run any map reduce job on  
> top of it,
> but in future we will be having map reduce operations on files in  
> HDFS.
>
> The current (test) system has 3 machines:
> NameNode: dual core CPU, 2GB RAM, 500GB HDD
> 2 DataNodes: Both of them with a dual core CPU, 2GB of RAM and 1.5TB  
> of
> space with ext3 filesystem.
>
> I just need to put and retrieve files from this system. The files  
> which I
> will put in HDFS varies from a few Bytes to a around 100MB, with the  
> average
> file-size being 5MB. and the number of files would grow around 20-50
> million. To avoid hitting limit of number of files under a  
> directory, I
> store each file at the path derived by the SHA1 hash of its content  
> (which
> is 20bytes long, and I create a 10 level deep path using 2bytes for  
> each
> level). When I started the cluster a month back, I had kept the  
> default
> block size to 1MB.
>
> The hardware specs mentioned at
> http://wiki.apache.org/hadoop/MachineScalingconsiders running map
> reduce operations. So not sure if my setup is good
> enough. I would like to get input on this setup.
> The final cluster would have each datanode with 8GB RAM, a quad core  
> CPU,
> and 25 TB attached storage.
>
> I played with this setup a little and then planned to increase the  
> disk
> space on both the DataNodes. I started by  increasing its disk  
> capacity of
> first dataNode to 15TB and changing the underlying filesystem to XFS  
> (which
> made it a clean datanode), and put it back in the system. Before  
> performing
> this operation, I had inserted around 70000 files in HDFS.
> **NameNode:50070/dfshealth.jsp
> showd  *677323 files and directories, 332419 blocks = 1009742 total  
> *. I
> guess the way I create a 10 level deep path for the file results in  
> ~10
> times the number of actual files in HDFS. Please correct me if I am  
> wrong. I
> then ran the rebalancer on the cleaned up DataNode, which was too slow
> (writing 2blocks per second i.e. 2MBps) to begin with and died after  
> a few
> hours saying too many open files. I checked all the machiens and all  
> the
> DataNode and NameNode processes were running fine on all the  
> respective
> machines, but the dfshealth.jsp showd both the datanodes to be dead.
> Re-starting the cluster brought both of them up. I guess this has to  
> do with
> RAM requirements. My question is how to figure out the RAM  
> requirements of
> DataNode and NameNode in this situation. The documentation states  
> that both
> Datanode and NameNode stores the block index. Its not quite clear if  
> all the
> index is in memory. Once I have figured that out, how can I instruct  
> the
> hadoop to rebalance with high priority?
>
> Another question is regarding the "Non DFS used:" statistics shown  
> on the
> dfshealth.jsp: Is it  the space used to store the files and directory
> metadata information (apart from the actual file content blocks)?  
> Right now
> it is 1/4th of the total space used by HDFS.
>
> Some points which I have thought of over the last month to improve  
> this
> model are:
> 1. I should keep very small files (lets say smaller than 1KB) out of  
> HDFS.
> 2. Reduce the dir level of the file path created by SHA1 hash  
> (instead of
> 10, I can keep 3).
> 3. I should increase the block size to reduce the number of blocks  
> in HDFS (
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/ 
> 200805.mbox/<
> 4aa34eb70805180030u5de8efaam6f1e9a8832636d42@mail.gmail.com> says it  
> won't
> result in waste of disk space).
>
> More improvement advices are appreciated.
>
> Thanks,
> Amit