You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Edward Capriolo <ed...@gmail.com> on 2009/12/23 20:31:47 UTC

Setting mapred.map.tasks and mapred.reduce.tasks per job in query plan

I often tune mapred.map.tasks & mapred.reduce.tasks per Hive query. For example.

set mapred.map.tasks=31;
set mapred.reduce.tasks=11;
FROM Pageviews2 p join client_ip c on c.id = p.clientip_id
insert overwrite directory '/user/ecapriolo/hivetest1'
SELECT ip, count(1)
WHERE (date_id = 6) AND ( p.sitename_id=5 OR p.sitename_id=9 OR
p.sitename_id=13 OR p.sitename_id=17)
GROUP BY ip;


This query actually operates on 4~5 GB, the run time is very
impressive and is accomplished with 3 M/R jobs.

Time taken: 91.64 seconds.

While being mostly clueless about how the optimizer/query planner
works, I would think that being able to set
mapred.map.tasks, and mapred.reduce.tasks as a hint to each MapReduce
phase would really kick up the performance.

For example this section of the query  (p.sitename_id=5 OR p.sitename_id=9 OR
p.sitename_id=13 OR p.sitename_id=17)

Will prune the result set greatly, subsequent phases really may not
need as many mappers/ or reduces as the first phase. Again, I have not
looked at reformulating the query which may aid in optimization but is
there a place for setting variables per phase?

Edward