You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by hi...@gmx.de on 2011/01/31 17:08:06 UTC

small files with hive and hadoop

Hello,

I like to do a reporting with Hive on something like tracking data.
The raw data which is about 2 gigs or more a day I want to query with hive. This works already for me, no problem.
Also I want to cascade down the reporting data to something like client, date, something in Hive like partitioned by (client String, date String).
That means I have multiple aggrgation-levels. I like to do all levels in Hive for a consistent reporting source.
And here is the thing: Might it a problem if it comes to many small files?
The aggrgation level e.g. client/date might produce files about 1MB and in amount of 1000 a day.
Is this a problem? I read about the "to many open files problem" with hadoop. And might this lead to a bad hive/map-reduce performance?
Maybe someone has some clues for that...

Thanks in advance
labtrax
-- 
GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit 
gratis Handy-Flat! http://portal.gmx.net/de/go/dsl

Re: small files with hive and hadoop

Posted by Edward Capriolo <ed...@gmail.com>.
On Mon, Jan 31, 2011 at 11:08 AM,  <hi...@gmx.de> wrote:
> Hello,
>
> I like to do a reporting with Hive on something like tracking data.
> The raw data which is about 2 gigs or more a day I want to query with hive. This works already for me, no problem.
> Also I want to cascade down the reporting data to something like client, date, something in Hive like partitioned by (client String, date String).
> That means I have multiple aggrgation-levels. I like to do all levels in Hive for a consistent reporting source.
> And here is the thing: Might it a problem if it comes to many small files?
> The aggrgation level e.g. client/date might produce files about 1MB and in amount of 1000 a day.
> Is this a problem? I read about the "to many open files problem" with hadoop. And might this lead to a bad hive/map-reduce performance?
> Maybe someone has some clues for that...
>
> Thanks in advance
> labtrax
> --
> GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit
> gratis Handy-Flat! http://portal.gmx.net/de/go/dsl
>

You probably do not want to partition on something that has a lot of
cardinality such as client_id . You do not want many small partitions
it is bad for the NameNode and mad for Map Reduce performance. So if
you have 1000 client ids that is 1000+ files per day and that is
trouble over a long period of time.

One option is to bucket on client into 64 Buckets on client_id. hive
can use the bucket to prune the amount of information that may get
table-scanned for scan. It is a compromise between many files and
really large files.

Generally you want big files so hadoop can use brute force table scans.

Edward

Re: small files with hive and hadoop

Posted by Ajo Fod <aj...@gmail.com>.
I've noticed that it takes a while for each map job to be set up in hive ...
and the way I set up the job I noticed that there were as many maps as
files/buckets.

I read a recommendation somewhere to design jobs such that they take at
least a minute.

Cheers,
-Ajo.

On Mon, Jan 31, 2011 at 8:08 AM, <hi...@gmx.de> wrote:

> Hello,
>
> I like to do a reporting with Hive on something like tracking data.
> The raw data which is about 2 gigs or more a day I want to query with hive.
> This works already for me, no problem.
> Also I want to cascade down the reporting data to something like client,
> date, something in Hive like partitioned by (client String, date String).
> That means I have multiple aggrgation-levels. I like to do all levels in
> Hive for a consistent reporting source.
> And here is the thing: Might it a problem if it comes to many small files?
> The aggrgation level e.g. client/date might produce files about 1MB and in
> amount of 1000 a day.
> Is this a problem? I read about the "to many open files problem" with
> hadoop. And might this lead to a bad hive/map-reduce performance?
> Maybe someone has some clues for that...
>
> Thanks in advance
> labtrax
> --
> GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit
> gratis Handy-Flat! http://portal.gmx.net/de/go/dsl
>