You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Min Zhou <co...@gmail.com> on 2009/08/04 05:02:28 UTC

why insert overwrite table tmp partition(dt=1) select bar, foo from pokes NEEDS 2 MR JOBS?

I thought one map only job is ok. try
hive> explain insert overwrite table tmp partition(dt=1) select bar, foo
from pokes;


Thanks,
Min
-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Re: why insert overwrite table tmp partition(dt=1) select bar, foo from pokes NEEDS 2 MR JOBS?

Posted by Zheng Shao <zs...@gmail.com>.

Hi Min,

We recently added a capability to Hive to merge small output files.

You can do the following to disable that feature:
set hive.merge.mapfiles=false;


OR you can adjust the following parameter to determine when the
additional merge job should run:
set hive.merge.size.per.task=256000000;

By default it's 256MB which means if the average output of a mapper is
smaller than 256MB, an additional job will run.
You can set that number to something like 64MB if you want.

Zheng

On Mon, Aug 3, 2009 at 8:02 PM, Min Zhou<co...@gmail.com> wrote:
> I thought one map only job is ok. try
> hive> explain insert overwrite table tmp partition(dt=1) select bar, foo
> from pokes;
>
>
> Thanks,
> Min
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>



-- 
Yours,
Zheng

Re: why insert overwrite table tmp partition(dt=1) select bar, foo from pokes NEEDS 2 MR JOBS?

Posted by Min Zhou <co...@gmail.com>.

Got  it . Thanks a lot , Zheng and Ashish!

Min

On Thu, Aug 6, 2009 at 2:59 AM, Ashish Thusoo <at...@facebook.com> wrote:

>  Not sure if this got answered. The second MR job in this case is for
> concatenating the outputs so that the files generated are much less than the
> mapper parallelism. This has advantages for jobs that consume the data. This
> feature was added recently. You can however turn it off using the following
> configuration variable.
>
> hive.merge.mapfiles=false
>
> This is true by default.
>
> Ashish
>  ------------------------------
> *From:* Min Zhou [mailto:coderplay@gmail.com]
> *Sent:* Monday, August 03, 2009 8:02 PM
> *To:* hive-user
> *Subject:* why insert overwrite table tmp partition(dt=1) select bar, foo
> from pokes NEEDS 2 MR JOBS?
>
> I thought one map only job is ok. try
> hive> explain insert overwrite table tmp partition(dt=1) select bar, foo
> from pokes;
>
>
> Thanks,
> Min
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>



-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

RE: why insert overwrite table tmp partition(dt=1) select bar, foo from pokes NEEDS 2 MR JOBS?

Posted by Ashish Thusoo <at...@facebook.com>.

Not sure if this got answered. The second MR job in this case is for concatenating the outputs so that the files generated are much less than the mapper parallelism. This has advantages for jobs that consume the data. This feature was added recently. You can however turn it off using the following configuration variable.

hive.merge.mapfiles=false

This is true by default.

Ashish
________________________________
From: Min Zhou [mailto:coderplay@gmail.com]
Sent: Monday, August 03, 2009 8:02 PM
To: hive-user
Subject: why insert overwrite table tmp partition(dt=1) select bar, foo from pokes NEEDS 2 MR JOBS?

I thought one map only job is ok. try
hive> explain insert overwrite table tmp partition(dt=1) select bar, foo from pokes;


Thanks,
Min
--
My research interests are distributed systems, parallel computing and bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com