You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Andraz Tori <an...@zemanta.com> on 2009/07/29 10:41:50 UTC

Multiple input files for a single map task?

Is there a way to tell Hive to take multiple input files as input for a
single map task.

Task setup time is so high in Hive/Hadoop that it really degrades
performance when there are many smaller files (10mb range). But there's
no reason why 10 different smaller files shouldn't be sent to the same
map task, the question is: does Hive support this scenario?

If yes, how to set it up?


-- 
Andraz Tori, CTO
Zemanta Ltd, New York, London, Ljubljana
www.zemanta.com
mail: andraz@zemanta.com
tel: +386 41 515 767
twitter: andraz, skype: minmax_test

Re: Multiple input files for a single map task?

Posted by Zheng Shao <zs...@gmail.com>.

Hi Andraz,

It's not supported right now.
This will be supported when Hive moves to hadoop 0.20 (with multi-file
inputformat).

An alternative is to merge these smaller files via a query like this:

set hive.exec.reducers.bytes.per.reducer=1000000000;
INSERT OVERWRITE t
SELECT * FROM t
DISTRIBUTE BY rand();

This will redistribute the data into smaller number of files.(each
file will be around 1GB).
You can change that number to get different file size.

Zheng

On Wed, Jul 29, 2009 at 1:41 AM, Andraz Tori<an...@zemanta.com> wrote:
> Is there a way to tell Hive to take multiple input files as input for a
> single map task.
>
> Task setup time is so high in Hive/Hadoop that it really degrades
> performance when there are many smaller files (10mb range). But there's
> no reason why 10 different smaller files shouldn't be sent to the same
> map task, the question is: does Hive support this scenario?
>
> If yes, how to set it up?
>
>
> --
> Andraz Tori, CTO
> Zemanta Ltd, New York, London, Ljubljana
> www.zemanta.com
> mail: andraz@zemanta.com
> tel: +386 41 515 767
> twitter: andraz, skype: minmax_test
>
>
>
>



-- 
Yours,
Zheng