You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Stuart Smith <st...@yahoo.com> on 2010/08/18 02:44:53 UTC

Maximum number of files in directory? (in hdfs)

Hello,
  I'm looking at storing a large number of files under one directory. 

I started to break the files into subdirectories out of habit (from working on ntfs/etc), but it occurred to me that maybe (from a performance perspective), it doesn't really matter on hdfs.

Does it? Is there some recommended limit on the number of files to store in one directory on hdfs? I'm thinking thousands to millions, so we're not talking about INT_MAX or anything, but a lot.

Or is it only limited by my sanity :) ?

I suppose it would come down to the data structure(s) used by the namenode when tracking file metadata. But I don't know what those are - I did skim the HDFS architecture document, but didn't see anything conclusive.

Take care,
  -stu


      

Re: Maximum number of files in directory? (in hdfs)

Posted by st...@yahoo.com.
Thanks!
I'll go with keeping my sanity then.

The files will all be >= 64MB

Take care,
 -stu 
-----Original Message-----
From: Allen Wittenauer <aw...@linkedin.com>
Date: Wed, 18 Aug 2010 01:00:42 
To: <hd...@hadoop.apache.org>
Reply-To: hdfs-user@hadoop.apache.org
Subject: Re: Maximum number of files in directory? (in hdfs)


On Aug 17, 2010, at 5:44 PM, Stuart Smith wrote:
> I started to break the files into subdirectories out of habit (from working on ntfs/etc), but it occurred to me that maybe (from a performance perspective), it doesn't really matter on hdfs.
> 
> Does it? Is there some recommended limit on the number of files to store in one directory on hdfs? I'm thinking thousands to millions, so we're not talking about INT_MAX or anything, but a lot.
> 
> Or is it only limited by my sanity :) ?

We have a directory with several thousand files in it.

It is always a pain when we hit it because the client heap size needs to be increased to do anything in it:  directory listings, web uis, distcp, etc, etc, etc.  Doing any sort of manipulation in that dir is also slower.

My recommendation: don't do it.  Directories, AFAIK, are relatively cheap resource wise vs. lots of files in one.

[Hopefully these files are large.  Otherwise they should be joined together... if not, you're going to take a performance hit processing them *and* storing them...]

Re: Maximum number of files in directory? (in hdfs)

Posted by Allen Wittenauer <aw...@linkedin.com>.
On Aug 17, 2010, at 5:44 PM, Stuart Smith wrote:
> I started to break the files into subdirectories out of habit (from working on ntfs/etc), but it occurred to me that maybe (from a performance perspective), it doesn't really matter on hdfs.
> 
> Does it? Is there some recommended limit on the number of files to store in one directory on hdfs? I'm thinking thousands to millions, so we're not talking about INT_MAX or anything, but a lot.
> 
> Or is it only limited by my sanity :) ?

We have a directory with several thousand files in it.

It is always a pain when we hit it because the client heap size needs to be increased to do anything in it:  directory listings, web uis, distcp, etc, etc, etc.  Doing any sort of manipulation in that dir is also slower.

My recommendation: don't do it.  Directories, AFAIK, are relatively cheap resource wise vs. lots of files in one.

[Hopefully these files are large.  Otherwise they should be joined together... if not, you're going to take a performance hit processing them *and* storing them...]