You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Min Zhou <co...@gmail.com> on 2009/06/01 06:33:22 UTC
Re: inefficient execution plan in this case

Hi,
I feel sorry leading you two focused on join stuffs,  but ineffectives on
union queries like below is the very hot spot I cared instead.

create table tmp( type string, other_cols' definition)

insert overwrite table tmp
select * from(
  select
    'class_a',
    other_cols
  from
    tbl
  where key.contains(a)

  union all
  select
    'class_b',
    other_cols
  from
    tbl
  where key.contains(b)





Thx,
Min
On Wed, May 27, 2009 at 1:53 AM, Namit Jain <nj...@facebook.com> wrote:

>  Try the following::
>
>
>
>
>
> insert overwrite table result_tbl
> select
>   t1.type,
>   t1.count1,
>   t2.count2,
>   t3.count3,
>   t4.count4
> from
>   (select
>     type,
>     distinct statistics_1 as count1
>   from
>     tmp
>   group by
>     type) t1
>
>   join
>   (select
>     type,
>     distinct statistics_2 as count2
>   from
>     tmp
>   group by
>     type) t2
>
>  on t1.type = t2.type
>
>    join
>   (select
>     type,
>     distinct statistics_3 as count3
>   from
>     tmp
>   group by
>     type) t3
>
>  on t1.type =   t3.type
>
>
>   join
>   (select
>     type,
>     distinct statistics_4 as count4
>   from
>     tmp
>   group by
>     type) t4
>
>   on t1.type=t4.type
>
>
>
> It should work and result in 5 map-reduce jobs
>
>
>
>
>
> *From:* Ashish Thusoo [mailto:athusoo@facebook.com]
> *Sent:* Tuesday, May 26, 2009 10:50 AM
> *To:* hive-user@hadoop.apache.org
> *Subject:* RE: inefficient execution plan in this case
>
>
>
> Hi Min,
>
>
>
> Can you give the exact query you tried? Hive does support queries with
> subqueries in the from clause. Namit's query was missing a select clause, so
> if you put that in, it should work...
>
>
>
> Ashish
>
>
>  ------------------------------
>
> *From:* Min Zhou [mailto:coderplay@gmail.com]
> *Sent:* Monday, May 25, 2009 11:59 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: inefficient execution plan in this case
>
> Thanks Namit.
> I don't think hive can parse this kind of queries you given including
> subqueries in the from clause. You will get message from hive syntax parser
> like below,
> FAILED: Parse Error: line 41:2 mismatched input 't1' expecting EOF
> Even though,  temporary tables can used here helping hive generating 5
> map-reduce jobs. but plans on joins isnot the hot spot I cared.
>
> First query with many unions lead to a map only job. If is executes like
> you explained,  scan table only once, why it cost more than 4 minutes while
> mine whole statistics using raw mapreduce need only 2 minutes, and mappers
> took less than 1 minute?
>
>
>  On Tue, May 26, 2009 at 1:42 PM, Namit Jain <nj...@facebook.com> wrote:
>
> Can you do a explain on the queries ? First query looks OK – 1 map-reduce
> job — The table should be scanned only once and not N times.
>
> The second query will have 7 map-reduce jobs, 4 for group-bys and 3 for
> JOINs.
>
> The way you have written the join query is wrong – it will lead to
> cartesian products of t1 and t2, and then the result of t1 and t2 with t3
> and then the join.
>
>
> Rewrite as:
>
> From t1 join t2 on t1.type=t2.type
> Join t3 on t1.type = t3.type
> Join t4 on t1.type = t4.type
>
> It will lead to 5 map-reduce jobs
>
>
> Predicate push down should fix that – can you file a jira for that, this
> join should be optimized with predicate pushdown.
>
>
>
>
>
>
> On 5/25/09 7:43 PM, "Min Zhou" <co...@gmail.com> wrote:
>
> Hi all,
>
> We had a mapreduce job scaning records and generating perhaps more than 10
> <key,value> pairs each record.  Here is mappers' pseudo-code
> map(key, value, ouput, reporter) {
>    if(key.contains(a)) {
>     output.collect('class_a', value)
>   }
>   if(key.contains(b)) {
>     output.collect('class_b', value)
>   }
>   ...
>   if(key.contains(z)) {
>     output.collect('class_z', value)
>   }
> }
>
> Reducers did some statisticals on the ouput of mappers.  The whole
> mapreduce job paid us only 2 minutes.  But we need more than 10 minutes
> using union all query in hive like below:
>
> create table tmp( type string, other_cols' definition)
>
> insert overwrite table tmp
> select * from(
>   select
>     'class_a',
>     other_cols
>   from
>     tbl
>   where key.contains(a)
>
>   union all
>   select
>     'class_b',
>     other_cols
>   from
>     tbl
>   where key.contains(b)
>
>   ...
>
>   union all
>   select
>     'class_z',
>     other_cols
>   from
>     tbl
>   where key.contains(z)
> ) t;
>
> create table result_tbl(type string, count1 int, count2 int, count3 int,
> count4 int);
>
> insert overwrite table result_tbl
> select
>   t1.type,
>   t1.count1,
>   t2.count2,
>   t3.count3,
>   t4.count4
> from
>   (select
>     type,
>     distinct statistics_1 as count1
>   from
>     tmp
>   group by
>     type) t1
>
>   join
>   (select
>     type,
>     distinct statistics_2 as count2
>   from
>     tmp
>   group by
>     type) t2
>
>   join
>   (select
>     type,
>     distinct statistics_3 as count3
>   from
>     tmp
>   group by
>     type) t3
>
>   join
>   (select
>     type,
>     distinct statistics_4 as count4
>   from
>     tmp
>   group by
>     type) t4
>
>   on t1.type=t2.type
>     and t1.type=t3.type
>     and t1.type=t4.type
>
> First query incuding many unions needed 1 mapreduce jobs, but seemed scan
> table N times, where N is the number of unioned select statements.
> Second query need 7 mapreduce jobs, obviouslly, more time needed.
>
>  Is there anything wrong about us when using Hive? Do you have solution on
> that issue? Thanks in advanced!
>
> Yours,
> Min
>
>
> Regards,
> Min
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>



-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com