You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Keith Wiley <kw...@keithwiley.com> on 2013/10/02 20:48:14 UTC

Use distribute to spread across reducers

I'm trying to create a subset of a large table for testing.  The following approach works:

create table subset_table as
select * from large_table limit 1000

...but it only uses one reducer.  I would like to speed up the process of creating a subset but distributing across multiple reducers.  I already tried explicitly setting mapred.reduce.tasks and hive.exec.reducers.max to values larger than 1, but in this particular case, those values seem to be over-ridden by Hive's internal query->to->mapreduce conversion; it ignores those parameters.

So, I tried this:

create table subset_table as
select * from large_table limit 1000
distribute by column_name

...but that doesn't parse.  I get the following error:

OK FAILED: ParseException line 3:0 missing EOF at 'distribute' near '1000'.

I have tried NUMEROUS applications of parentheses, nested queries, etc.  For example, here's just one (amongst perhaps ten variations on a theme):

create table subset_table as
select * from (
from (
select * from large_table limit 1000
distribute by column_name
)) s

Like I said, I've tried all sorts of combinations of the elements shown above.  So far I have not even gotten any syntax to parse, much less run.  Only the original query at the top will even pass the parsing stage of processing.

Any ideas?

Thanks.

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
                                           --  Galileo Galilei
________________________________________________________________________________


Re: Use distribute to spread across reducers

Posted by Timothy Potter <th...@gmail.com>.
Hi Keith,

Have you tried the TABLESAMPLE command?
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling

Tim


On Thu, Oct 3, 2013 at 11:58 AM, Yin Huai <hu...@gmail.com> wrote:

> Hello Keith,
>
> Hive will not launch a MR job for your query because it basically reads
> all columns from a table. Hive will fetch the data for you directly from
> the underlying filesystem.
>
> Thanks,
>
> Yin
>
>
>
> On Wed, Oct 2, 2013 at 2:48 PM, Keith Wiley <kw...@keithwiley.com> wrote:
>
>> I'm trying to create a subset of a large table for testing.  The
>> following approach works:
>>
>> create table subset_table as
>> select * from large_table limit 1000
>>
>> ...but it only uses one reducer.  I would like to speed up the process of
>> creating a subset but distributing across multiple reducers.  I already
>> tried explicitly setting mapred.reduce.tasks and hive.exec.reducers.max to
>> values larger than 1, but in this particular case, those values seem to be
>> over-ridden by Hive's internal query->to->mapreduce conversion; it ignores
>> those parameters.
>>
>> So, I tried this:
>>
>> create table subset_table as
>> select * from large_table limit 1000
>> distribute by column_name
>>
>> ...but that doesn't parse.  I get the following error:
>>
>> OK FAILED: ParseException line 3:0 missing EOF at 'distribute' near
>> '1000'.
>>
>> I have tried NUMEROUS applications of parentheses, nested queries, etc.
>>  For example, here's just one (amongst perhaps ten variations on a theme):
>>
>> create table subset_table as
>> select * from (
>> from (
>> select * from large_table limit 1000
>> distribute by column_name
>> )) s
>>
>> Like I said, I've tried all sorts of combinations of the elements shown
>> above.  So far I have not even gotten any syntax to parse, much less run.
>>  Only the original query at the top will even pass the parsing stage of
>> processing.
>>
>> Any ideas?
>>
>> Thanks.
>>
>>
>> ________________________________________________________________________________
>> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
>> music.keithwiley.com
>>
>> "I do not feel obliged to believe that the same God who has endowed us
>> with
>> sense, reason, and intellect has intended us to forgo their use."
>>                                            --  Galileo Galilei
>>
>> ________________________________________________________________________________
>>
>>
>

Re: Use distribute to spread across reducers

Posted by Yin Huai <hu...@gmail.com>.
Hello Keith,

Hive will not launch a MR job for your query because it basically reads all
columns from a table. Hive will fetch the data for you directly from the
underlying filesystem.

Thanks,

Yin



On Wed, Oct 2, 2013 at 2:48 PM, Keith Wiley <kw...@keithwiley.com> wrote:

> I'm trying to create a subset of a large table for testing.  The following
> approach works:
>
> create table subset_table as
> select * from large_table limit 1000
>
> ...but it only uses one reducer.  I would like to speed up the process of
> creating a subset but distributing across multiple reducers.  I already
> tried explicitly setting mapred.reduce.tasks and hive.exec.reducers.max to
> values larger than 1, but in this particular case, those values seem to be
> over-ridden by Hive's internal query->to->mapreduce conversion; it ignores
> those parameters.
>
> So, I tried this:
>
> create table subset_table as
> select * from large_table limit 1000
> distribute by column_name
>
> ...but that doesn't parse.  I get the following error:
>
> OK FAILED: ParseException line 3:0 missing EOF at 'distribute' near '1000'.
>
> I have tried NUMEROUS applications of parentheses, nested queries, etc.
>  For example, here's just one (amongst perhaps ten variations on a theme):
>
> create table subset_table as
> select * from (
> from (
> select * from large_table limit 1000
> distribute by column_name
> )) s
>
> Like I said, I've tried all sorts of combinations of the elements shown
> above.  So far I have not even gotten any syntax to parse, much less run.
>  Only the original query at the top will even pass the parsing stage of
> processing.
>
> Any ideas?
>
> Thanks.
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "I do not feel obliged to believe that the same God who has endowed us with
> sense, reason, and intellect has intended us to forgo their use."
>                                            --  Galileo Galilei
>
> ________________________________________________________________________________
>
>