You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Keith Wiley <kw...@keithwiley.com> on 2013/09/30 20:31:33 UTC

Want query to use more reducers

I have a query that doesn't use reducers as efficiently as I would hope.  If I run it on a large table, it uses more reducers, even saturating the cluster, as I desire.  However, on smaller tables it uses as low as a single reducer.  While I understand there is a logic in this (not using multiple reducers until the data size is larger), it is nevertheless inefficient to run a query for thirty minutes leaving the entire cluster vacant when the query could distribute the work evenly and wrap things up in a fraction of the time.  The query is shown below (abstracted to its basic form).  As you can see, it is a little atypical: it is a nested query which obviously implies two map-reduce jobs and it uses a script for the reducer stage that I am trying to speed up.  I thought the "distribute by" clause should make it use the reducers more evenly, but as I said, that is not the behavior I am seeing.

Any ideas how I could improve this situation?

Thanks.

CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as 
SELECT * FROM (
	FROM (
		SELECT * FROM input_table
		DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, input_column_2 ASC, input_column_etc ASC) q
	SELECT TRANSFORM(*)
	USING 'python my_reducer_script.py' AS(
	output_column_1,
	output_column_2,
	output_column_etc,
	)
) s
ORDER BY output_column_1;

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________

Re: Want query to use more reducers

Posted by Keith Wiley <kw...@keithwiley.com>.

Thanks.  mapred.reduce.tasks and hive.exec.reducers.max seem to have fixed the problem.  It is now saturating the cluster and running the query super fast.  Excellent!

On Sep 30, 2013, at 12:28 , Sean Busbey wrote:

> Hey Keith,
> 
> It sounds like you should tweak the settings for how Hive handles query execution[1]:
> 
> 1) Tune the guessed number of reducers based on input size
> 
> = hive.exec.reducers.bytes.per.reducer
> 
> Defaults to 1G. Based on your description, it sounds like this is probably still at default.
> 
> In this case, you should also set a max # of reducers based on your cluster size.
> 
> = hive.exec.reducers.max
> 
> I usually set this to the # reduce slots, if there's a decent chance I'll get to saturate the cluster. If not, don't worry about it.
> 
> 2) Hard code a number of reducers
> 
> = mapred.reduce.tasks
> 
> Setting this will cause Hive to always use that number. It defaults to -1, which tells hive to use the heuristic about input size to guess.
> 
> In either of the above cases, you should look at the options to merge small files (search for "merge"  in the configuration property list) to avoid getting lots of little outputs.
> 
> HTH
> 
> [1]: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryExecution
> 
> -Sean
> 
> On Mon, Sep 30, 2013 at 11:31 AM, Keith Wiley <kw...@keithwiley.com> wrote:
> I have a query that doesn't use reducers as efficiently as I would hope.  If I run it on a large table, it uses more reducers, even saturating the cluster, as I desire.  However, on smaller tables it uses as low as a single reducer.  While I understand there is a logic in this (not using multiple reducers until the data size is larger), it is nevertheless inefficient to run a query for thirty minutes leaving the entire cluster vacant when the query could distribute the work evenly and wrap things up in a fraction of the time.  The query is shown below (abstracted to its basic form).  As you can see, it is a little atypical: it is a nested query which obviously implies two map-reduce jobs and it uses a script for the reducer stage that I am trying to speed up.  I thought the "distribute by" clause should make it use the reducers more evenly, but as I said, that is not the behavior I am seeing.
> 
> Any ideas how I could improve this situation?
> 
> Thanks.
> 
> CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as
> SELECT * FROM (
>         FROM (
>                 SELECT * FROM input_table
>                 DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, input_column_2 ASC, input_column_etc ASC) q
>         SELECT TRANSFORM(*)
>         USING 'python my_reducer_script.py' AS(
>         output_column_1,
>         output_column_2,
>         output_column_etc,
>         )
> ) s
> ORDER BY output_column_1;
> 
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
> 
> "Luminous beings are we, not this crude matter."
>                                            --  Yoda
> ________________________________________________________________________________
> 
> 
> 
> 
> -- 
> Sean


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
                                           --  Galileo Galilei
________________________________________________________________________________

Re: Want query to use more reducers

Posted by Sean Busbey <bu...@cloudera.com>.

Hey Keith,

It sounds like you should tweak the settings for how Hive handles query
execution[1]:

1) Tune the guessed number of reducers based on input size

= hive.exec.reducers.bytes.per.reducer

Defaults to 1G. Based on your description, it sounds like this is probably
still at default.

In this case, you should also set a max # of reducers based on your cluster
size.

= hive.exec.reducers.max

I usually set this to the # reduce slots, if there's a decent chance I'll
get to saturate the cluster. If not, don't worry about it.

2) Hard code a number of reducers

= mapred.reduce.tasks

Setting this will cause Hive to always use that number. It defaults to -1,
which tells hive to use the heuristic about input size to guess.

In either of the above cases, you should look at the options to merge small
files (search for "merge"  in the configuration property list) to avoid
getting lots of little outputs.

HTH

[1]:
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryExecution

-Sean

On Mon, Sep 30, 2013 at 11:31 AM, Keith Wiley <kw...@keithwiley.com> wrote:

> I have a query that doesn't use reducers as efficiently as I would hope.
>  If I run it on a large table, it uses more reducers, even saturating the
> cluster, as I desire.  However, on smaller tables it uses as low as a
> single reducer.  While I understand there is a logic in this (not using
> multiple reducers until the data size is larger), it is nevertheless
> inefficient to run a query for thirty minutes leaving the entire cluster
> vacant when the query could distribute the work evenly and wrap things up
> in a fraction of the time.  The query is shown below (abstracted to its
> basic form).  As you can see, it is a little atypical: it is a nested query
> which obviously implies two map-reduce jobs and it uses a script for the
> reducer stage that I am trying to speed up.  I thought the "distribute by"
> clause should make it use the reducers more evenly, but as I said, that is
> not the behavior I am seeing.
>
> Any ideas how I could improve this situation?
>
> Thanks.
>
> CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as
> SELECT * FROM (
>         FROM (
>                 SELECT * FROM input_table
>                 DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC,
> input_column_2 ASC, input_column_etc ASC) q
>         SELECT TRANSFORM(*)
>         USING 'python my_reducer_script.py' AS(
>         output_column_1,
>         output_column_2,
>         output_column_etc,
>         )
> ) s
> ORDER BY output_column_1;
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "Luminous beings are we, not this crude matter."
>                                            --  Yoda
>
> ________________________________________________________________________________
>
>


-- 
Sean