You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Naveen Mahale <na...@zinniasystems.com> on 2011/09/14 09:52:10 UTC

Handling of small files in hadoop

Hi all,

I use hadoop-0.21.0 distribution. I have a large number of small files (KB).
Is there any efficient way of handling it in hadoop?

I have heard that solution for that problem is using:
            1. HAR (hadoop archives)
            2. cat on files

I would like to know if there are any other solutions for processing large
number of small files.

Regards,
Naveen Mahale

Re: Handling of small files in hadoop

Posted by Naveen Mahale <na...@zinniasystems.com>.

Hey, thanks Joey for that information. Would work on what you said.

Regards
Naveen Mahale

On Wed, Sep 14, 2011 at 5:32 PM, Joey Echeverria <jo...@cloudera.com> wrote:

> Hi Naveen,
>
> > I use hadoop-0.21.0 distribution. I have a large number of small files
> (KB).
>
> Word of warning, 0.21 is not a stable release. The recommended version
> is in the 0.20.x range.
>
> > Is there any efficient way of handling it in hadoop?
> >
> > I have heard that solution for that problem is using:
> >            1. HAR (hadoop archives)
> >            2. cat on files
> >
> > I would like to know if there are any other solutions for processing
> large
> > number of small files.
>
> You could also stick each file as a record in a sequence file. The
> name of the file becomes the key, the bytes of the file the value.
> That gives you compression and splitability, but not random access.
> You already noted HAR, which does give you random access.
>
> -Joey
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>

Re: Handling of small files in hadoop

Posted by Joey Echeverria <jo...@cloudera.com>.

Hi Naveen,

> I use hadoop-0.21.0 distribution. I have a large number of small files (KB).

Word of warning, 0.21 is not a stable release. The recommended version
is in the 0.20.x range.

> Is there any efficient way of handling it in hadoop?
>
> I have heard that solution for that problem is using:
>            1. HAR (hadoop archives)
>            2. cat on files
>
> I would like to know if there are any other solutions for processing large
> number of small files.

You could also stick each file as a record in a sequence file. The
name of the file becomes the key, the bytes of the file the value.
That gives you compression and splitability, but not random access.
You already noted HAR, which does give you random access.

-Joey



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Handling of small files in hadoop

Posted by Віталій Тимчишин <ti...@gmail.com>.

Note that har files are not optimal. The index file is scanned in linear
fashion, so with lot of files you have to read it fully and it may be times
larger then the file you are trying to read
14.09.2011 10:53 пользователь "Naveen Mahale" <na...@zinniasystems.com>
написал: