You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Alexander Pivovarov (JIRA)" <ji...@apache.org> on 2015/02/02 02:25:35 UTC
[jira] [Updated] (HIVE-9495) Map Side aggregation affecting map performance

     [ https://issues.apache.org/jira/browse/HIVE-9495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexander Pivovarov updated HIVE-9495:
--------------------------------------
    Description: 
When trying to run a simple aggregation query with hive.map.aggr=true, map tasks take a lot of time in Hive 0.14 as against  with hive.map.aggr=false.

e.g.
Consider the query:
{code}
INSERT OVERWRITE TABLE lineitem_tgt_agg
select alias.a0 as a0,
 alias.a2 as a1,
 alias.a1 as a2,
 alias.a3 as a3,
 alias.a4 as a4
from (
 select alias.a0 as a0,
  SUM(alias.a1) as a1,
  SUM(alias.a2) as a2,
  SUM(alias.a3) as a3,
  SUM(alias.a4) as a4
 from (
  select lineitem_sf500.l_orderkey as a0,
   CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * (1 - lineitem_sf500.l_discount) * (1 + lineitem_sf500.l_tax) as double) as a1,
   lineitem_sf500.l_quantity as a2,
   CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * lineitem_sf500.l_discount as double) as a3,
   CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * lineitem_sf500.l_tax as double) as a4
  from lineitem_sf500
  ) alias
 group by alias.a0
 ) alias;
{code}
The above query was run with ~376GB of data / ~3billion records in the source.
It takes ~10 minutes with hive.map.aggr=false.
With map side aggregation set to true, the map tasks don't complete even after an hour.




  was:
When trying to run a simple aggregation query with hive.map.aggr=true, map tasks take a lot of time in Hive 0.14 as against  with hive.map.aggr=false.

e.g.
Consider the query:
INSERT OVERWRITE TABLE lineitem_tgt_agg SELECT alias.a0 as a0, alias.a2 as a1, alias.a1 as a2, alias.a3 as a3, alias.a4 as a4 FROM (SELECT alias.a0 as a0, SUM(alias.a1) as a1, SUM(alias.a2) as a2, SUM(alias.a3) as a3, SUM(alias.a4) as a4 FROM (SELECT lineitem_sf500.l_orderkey as a0, CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * (1 - lineitem_sf500.l_discount) * (1 + lineitem_sf500.l_tax) AS DOUBLE) as a1, lineitem_sf500.l_quantity as a2, CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * lineitem_sf500.l_discount AS DOUBLE) as a3, CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * lineitem_sf500.l_tax AS DOUBLE) as a4 FROM lineitem_sf500) alias GROUP BY alias.a0) alias;

The above query was run with ~376GB of data / ~3billion records in the source.
It takes ~10 minutes with hive.map.aggr=false.
With map side aggregation set to true, the map tasks don't complete even after an hour.





> Map Side aggregation affecting map performance
> ----------------------------------------------
>
>                 Key: HIVE-9495
>                 URL: https://issues.apache.org/jira/browse/HIVE-9495
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.14.0
>         Environment: RHEL 6.4
> Hortonworks Hadoop 2.2
>            Reporter: Anand Sridharan
>         Attachments: profiler_screenshot.PNG
>
>
> When trying to run a simple aggregation query with hive.map.aggr=true, map tasks take a lot of time in Hive 0.14 as against  with hive.map.aggr=false.
> e.g.
> Consider the query:
> {code}
> INSERT OVERWRITE TABLE lineitem_tgt_agg
> select alias.a0 as a0,
>  alias.a2 as a1,
>  alias.a1 as a2,
>  alias.a3 as a3,
>  alias.a4 as a4
> from (
>  select alias.a0 as a0,
>   SUM(alias.a1) as a1,
>   SUM(alias.a2) as a2,
>   SUM(alias.a3) as a3,
>   SUM(alias.a4) as a4
>  from (
>   select lineitem_sf500.l_orderkey as a0,
>    CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * (1 - lineitem_sf500.l_discount) * (1 + lineitem_sf500.l_tax) as double) as a1,
>    lineitem_sf500.l_quantity as a2,
>    CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * lineitem_sf500.l_discount as double) as a3,
>    CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * lineitem_sf500.l_tax as double) as a4
>   from lineitem_sf500
>   ) alias
>  group by alias.a0
>  ) alias;
> {code}
> The above query was run with ~376GB of data / ~3billion records in the source.
> It takes ~10 minutes with hive.map.aggr=false.
> With map side aggregation set to true, the map tasks don't complete even after an hour.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)