You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Min Zhou <co...@gmail.com> on 2009/06/01 06:33:22 UTC
Re: inefficient execution plan in this case
Hi,
I feel sorry leading you two focused on join stuffs, but ineffectives on
union queries like below is the very hot spot I cared instead.
create table tmp( type string, other_cols' definition)
insert overwrite table tmp
select * from(
select
'class_a',
other_cols
from
tbl
where key.contains(a)
union all
select
'class_b',
other_cols
from
tbl
where key.contains(b)
Thx,
Min
On Wed, May 27, 2009 at 1:53 AM, Namit Jain <nj...@facebook.com> wrote:
> Try the following::
>
>
>
>
>
> insert overwrite table result_tbl
> select
> t1.type,
> t1.count1,
> t2.count2,
> t3.count3,
> t4.count4
> from
> (select
> type,
> distinct statistics_1 as count1
> from
> tmp
> group by
> type) t1
>
> join
> (select
> type,
> distinct statistics_2 as count2
> from
> tmp
> group by
> type) t2
>
> on t1.type = t2.type
>
> join
> (select
> type,
> distinct statistics_3 as count3
> from
> tmp
> group by
> type) t3
>
> on t1.type = t3.type
>
>
> join
> (select
> type,
> distinct statistics_4 as count4
> from
> tmp
> group by
> type) t4
>
> on t1.type=t4.type
>
>
>
> It should work and result in 5 map-reduce jobs
>
>
>
>
>
> *From:* Ashish Thusoo [mailto:athusoo@facebook.com]
> *Sent:* Tuesday, May 26, 2009 10:50 AM
> *To:* hive-user@hadoop.apache.org
> *Subject:* RE: inefficient execution plan in this case
>
>
>
> Hi Min,
>
>
>
> Can you give the exact query you tried? Hive does support queries with
> subqueries in the from clause. Namit's query was missing a select clause, so
> if you put that in, it should work...
>
>
>
> Ashish
>
>
> ------------------------------
>
> *From:* Min Zhou [mailto:coderplay@gmail.com]
> *Sent:* Monday, May 25, 2009 11:59 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: inefficient execution plan in this case
>
> Thanks Namit.
> I don't think hive can parse this kind of queries you given including
> subqueries in the from clause. You will get message from hive syntax parser
> like below,
> FAILED: Parse Error: line 41:2 mismatched input 't1' expecting EOF
> Even though, temporary tables can used here helping hive generating 5
> map-reduce jobs. but plans on joins isnot the hot spot I cared.
>
> First query with many unions lead to a map only job. If is executes like
> you explained, scan table only once, why it cost more than 4 minutes while
> mine whole statistics using raw mapreduce need only 2 minutes, and mappers
> took less than 1 minute?
>
>
> On Tue, May 26, 2009 at 1:42 PM, Namit Jain <nj...@facebook.com> wrote:
>
> Can you do a explain on the queries ? First query looks OK – 1 map-reduce
> job — The table should be scanned only once and not N times.
>
> The second query will have 7 map-reduce jobs, 4 for group-bys and 3 for
> JOINs.
>
> The way you have written the join query is wrong – it will lead to
> cartesian products of t1 and t2, and then the result of t1 and t2 with t3
> and then the join.
>
>
> Rewrite as:
>
> From t1 join t2 on t1.type=t2.type
> Join t3 on t1.type = t3.type
> Join t4 on t1.type = t4.type
>
> It will lead to 5 map-reduce jobs
>
>
> Predicate push down should fix that – can you file a jira for that, this
> join should be optimized with predicate pushdown.
>
>
>
>
>
>
> On 5/25/09 7:43 PM, "Min Zhou" <co...@gmail.com> wrote:
>
> Hi all,
>
> We had a mapreduce job scaning records and generating perhaps more than 10
> <key,value> pairs each record. Here is mappers' pseudo-code
> map(key, value, ouput, reporter) {
> if(key.contains(a)) {
> output.collect('class_a', value)
> }
> if(key.contains(b)) {
> output.collect('class_b', value)
> }
> ...
> if(key.contains(z)) {
> output.collect('class_z', value)
> }
> }
>
> Reducers did some statisticals on the ouput of mappers. The whole
> mapreduce job paid us only 2 minutes. But we need more than 10 minutes
> using union all query in hive like below:
>
> create table tmp( type string, other_cols' definition)
>
> insert overwrite table tmp
> select * from(
> select
> 'class_a',
> other_cols
> from
> tbl
> where key.contains(a)
>
> union all
> select
> 'class_b',
> other_cols
> from
> tbl
> where key.contains(b)
>
> ...
>
> union all
> select
> 'class_z',
> other_cols
> from
> tbl
> where key.contains(z)
> ) t;
>
> create table result_tbl(type string, count1 int, count2 int, count3 int,
> count4 int);
>
> insert overwrite table result_tbl
> select
> t1.type,
> t1.count1,
> t2.count2,
> t3.count3,
> t4.count4
> from
> (select
> type,
> distinct statistics_1 as count1
> from
> tmp
> group by
> type) t1
>
> join
> (select
> type,
> distinct statistics_2 as count2
> from
> tmp
> group by
> type) t2
>
> join
> (select
> type,
> distinct statistics_3 as count3
> from
> tmp
> group by
> type) t3
>
> join
> (select
> type,
> distinct statistics_4 as count4
> from
> tmp
> group by
> type) t4
>
> on t1.type=t2.type
> and t1.type=t3.type
> and t1.type=t4.type
>
> First query incuding many unions needed 1 mapreduce jobs, but seemed scan
> table N times, where N is the number of unioned select statements.
> Second query need 7 mapreduce jobs, obviouslly, more time needed.
>
> Is there anything wrong about us when using Hive? Do you have solution on
> that issue? Thanks in advanced!
>
> Yours,
> Min
>
>
> Regards,
> Min
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>
--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.
My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com