You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Steve Sapovits <ss...@invitemedia.com> on 2008/02/22 04:38:37 UTC

file/directory sizes

I'm looking for any information on "best" type Hadoop configurations, in terms of
numbers of files, numbers of files per directory, and file sizes (e.g., are 
lots of small
files more of a problem than fewer larger ones, etc.).

Any pointers to documentation or experience feedback appreciated.

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com


Re: file/directory sizes

Posted by Ted Dunning <td...@veoh.com>.
It is definitely better to combine files into larger ones, if only to make
sure that you use sequential reads as much as possible.


On 2/21/08 9:48 PM, "Steve Sapovits" <ss...@invitemedia.com> wrote:

> Amar Kamat wrote:
> 
>> File sizes and number of files (assuming thats what you want to tweak)
>> is not much of a concern for map-reduce. What ultimately matters is the
>> dfs-block-size and split-size. The basic unit of replication in DFS is
>> the block while the basic processing unit for map-reduce is the split.
>> Other parameters doesn't matter much if you control the block size
>> (dfs.block.size) and the split size (mapred.min.split.size).
> 
> What about the write side?   Someone indicated to me that HDFS wasn't
> real good about storing lots of small files -- that it would be better to
> somehow combine things into larger files.


Re: file/directory sizes

Posted by Steve Sapovits <ss...@invitemedia.com>.
Amar Kamat wrote:

> File sizes and number of files (assuming thats what you want to tweak) 
> is not much of a concern for map-reduce. What ultimately matters is the 
> dfs-block-size and split-size. The basic unit of replication in DFS is 
> the block while the basic processing unit for map-reduce is the split. 
> Other parameters doesn't matter much if you control the block size 
> (dfs.block.size) and the split size (mapred.min.split.size).

What about the write side?   Someone indicated to me that HDFS wasn't
real good about storing lots of small files -- that it would be better to
somehow combine things into larger files.

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com


Re: file/directory sizes

Posted by Amar Kamat <am...@yahoo-inc.com>.
File sizes and number of files (assuming thats what you want to tweak) 
is not much of a concern for map-reduce. What ultimately matters is the 
dfs-block-size and split-size. The basic unit of replication in DFS is 
the block while the basic processing unit for map-reduce is the split. 
Other parameters doesn't matter much if you control the block size 
(dfs.block.size) and the split size (mapred.min.split.size).
Amar
Steve Sapovits wrote:
>
> I'm looking for any information on "best" type Hadoop configurations, 
> in terms of
> numbers of files, numbers of files per directory, and file sizes 
> (e.g., are lots of small
> files more of a problem than fewer larger ones, etc.).
>
> Any pointers to documentation or experience feedback appreciated.
>