You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Kamil Bajda-Pawlikowski (JIRA)" <ji...@apache.org> on 2010/02/28 20:18:06 UTC

[jira] Commented: (HIVE-600) Running TPC-H queries on Hive

    [ https://issues.apache.org/jira/browse/HIVE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839478#action_12839478 ] 

Kamil Bajda-Pawlikowski commented on HIVE-600:
----------------------------------------------

Hi Yuntao,

I have attempted to run TPC-H on Hive. Thanks for really well prepared scripts!

During the first query, I realized that things are not going well. It seems that Aaron's concern about the number of reducers was valid one.
However, the problem is that Hive schedules too many reducers! The default configuration of Hive tries to determine # of tasks automatically using value of  "hive.exec.reducers.bytes.per.reducer" property (the default setting is to have one reduce task per 1GB of input data). When the size of the data is huge, this is inefficient. This needs to capped!

For example in my case, there is 50GB data per node, but only 2 reduce task slots and I'm getting 25 reduce task waves. Q1 ran for 1h49min. In contrast, when I set "hive.exec.reducers.max" property to the number of reduce slots in my Hadoop installation, the query running time is only about 23min. Of note, the default value for "hive.exec.reducers.max" is 999.

The above issue was not too bad for the data size you used. TPC-H dataset with SF=100 translates into at most 100 reducers per job, and with 40 reduce slots in total, each job had max. 2.5 reduce task waves. Still, your numbers could be somewhat better by capping "hive.exec.reducers.max" to 40 per Tom White's tip #9 from http://www.cloudera.com/blog/2009/05/10-mapreduce-tips.

Could please confirm whether my understanding is correct.

Thank you,
Kamil





> Running TPC-H queries on Hive
> -----------------------------
>
>                 Key: HIVE-600
>                 URL: https://issues.apache.org/jira/browse/HIVE-600
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Yuntao Jia
>            Assignee: Yuntao Jia
>         Attachments: TPC-H_on_Hive_2009-08-11.pdf, TPC-H_on_Hive_2009-08-11.tar.gz, TPC-H_on_Hive_2009-08-14.tar.gz
>
>
> The goal is to run all TPC-H (http://www.tpc.org/tpch/) benchmark queries on Hive for two reasons. First, through those queries, we would like to find the new features that we need to put into Hive so that Hive supports common SQL queries. Second, we would like to measure the performance of Hive to find out what Hive is not good at. We can then improve Hive based on those information. 
> For queries that are not supported now in Hive, I will try to rewrite them to one or more Hive-supported queries. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.