You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by KayVajj <va...@gmail.com> on 2013/06/03 06:11:41 UTC

Lots of files in a directory vs files in sub directories

Hi,

I am trying to figure s strategy around partitions in hive. I'm thinking
either a monthly or a daily partition. The usage directs me go towards the
daily partition scheme(querying etc), but I'm not sure what would be the
HDFS, Name Node limitations to this.

If for a daily partition I would have 3-4 GB of file in each partition and
for 2 years I might end up having

700 and odd directories with one file each. On the contrary in monthly I
would have 24 directories with each directory having 30 or 31 files of 4 GB
each.

Most of my queries are in the date range and I was thinking daily
partitions would be more effective as it doesn't have to scan all the files
for the month in case of a monthly partition.

I would like to know what other considerations should I think about before
making a decision.

1) Name node/ HDFS limitations
2) Archiving files
3) compression

and may be more.

I would really appreciate any inputs on this

Thanks
Kishore