You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Mikael Öhman <mi...@yahoo.se> on 2013/09/16 10:23:59 UTC

Duplicate rows when using group by in subquery

Hello.

This is basically the same question I posted on stackoverflow: http://stackoverflow.com/questions/18812390/hive-subquery-and-group-by/18818115?noredirect=1#18818115

I know the query is a bit noisy. But this query also demonstrates the error:

select a.symbol from (select symbol, ordertype from cat group by symbol, ordertype) a group by a.symbol;

Now, this query may not make much sense but in my case, because I have 24 symbols, I expect a result of 24 rows. But I get 48 rows back. A similar query:

select a.Symbol,count(*) from (select c.Symbol,c.catid from cat as c group by c.Symbol,c.catid) a group by a.Symbol;

returns 864 rows, where I still expect 24 rows... If there are alternatives as to how to write the original query in my SO post I would much appreciate hearing them. The examples given in this mail have just been provided to demonstrate the problem using easier to understand queries and I don't need advice on them.

The .csv data and example is from a toy example. My real setup is 6 nodes, and the table definition is:

create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int, ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string, SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID int) row format delimited fields terminated by ',' stored as ORC;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=1000;

insert...

Thank you so much for any input.

/Sincerely Mikael

Re: Duplicate rows when using group by in subquery

Posted by Yin Huai <hu...@gmail.com>.
Maybe you were stilling using the cli which was pointing to hive 0.11 libs.
 After you build trunk (https://github.com/apache/hive.git), you need to
use <trunk-dir>/build/dist as your hive home and
use <trunk-dir>/build/dist/bin/hive to launch hive cli. You can find
hive 0.13 libs in <trunk-dir>/build/dist/lib

btw, seems trunk has an issue today. You can try hive 0.12 branch.


On Thu, Sep 19, 2013 at 4:26 AM, Mikael Öhman <mi...@yahoo.se> wrote:

> Hello again.
>
> I have now checked out latest code from trunk and built as per
> instructions.
>
> However, this query:
>
> select a.Symbol, count(*)
> from (select Symbol, catid from cat group by Symbol, catid) a
> group by a.Symbol;
>
> still returns an incorrect number of rows for table:
>
>
> create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE
> int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int,
> ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string,
> SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID
> int) row format delimited fields terminated by ',' stored as ORC;
>
> Here is the result of EXPLAIN:
>
> hive> EXPLAIN select a.Symbol, count(*)
>     > from (select Symbol, catid from cat group by Symbol, catid) a
>     > group by a.Symbol;
> OK
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF
> (TOK_TABNAME cat))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
> (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL Symbol)) (TOK_SELEXPR
> (TOK_TABLE_OR_COL catid))) (TOK_GROUPBY (TOK_TABLE_OR_COL Symbol)
> (TOK_TABLE_OR_COL catid)))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR
> TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) Symbol))
> (TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_GROUPBY (. (TOK_TABLE_OR_COL
> a) Symbol))))
>
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
>
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         a:cat
>           TableScan
>             alias: cat
>             Select Operator
>               expressions:
>                     expr: symbol
>                     type: string
>                     expr: catid
>                     type: bigint
>               outputColumnNames: symbol, catid
>               Group By Operator
>                 bucketGroup: false
>                 keys:
>                       expr: symbol
>                       type: string
>                       expr: catid
>                       type: bigint
>                 mode: hash
>                 outputColumnNames: _col0, _col1
>                 Reduce Output Operator
>                   key expressions:
>                         expr: _col0
>                         type: string
>                         expr: _col1
>                         type: bigint
>                   sort order: ++
>                   Map-reduce partition columns:
>                         expr: _col0
>                         type: string
>                         expr: _col1
>                         type: bigint
>                   tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           bucketGroup: false
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>             outputColumnNames: _col0
>             Group By Operator
>               aggregations:
>                     expr: count()
>               bucketGroup: false
>               keys:
>                     expr: _col0
>                     type: string
>               mode: complete
>               outputColumnNames: _col0, _col1
>               Select Operator
>                 expressions:
>                       expr: _col0
>                       type: string
>                       expr: _col1
>                       type: bigint
>                 outputColumnNames: _col0, _col1
>                 File Output Operator
>                   compressed: false
>                   GlobalTableId: 0
>                   table:
>                       input format:
> org.apache.hadoop.mapred.TextInputFormat
>                       output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>
> Using set hive.optimize.reducededuplication=false;
> I get 2 mapreduce jobs and the correct number of rows (24).
>
> Can I verify somehow, maybe through looking in the source code, that I
> indeed have the correct version? Or execute a command from hive cli that
> shows version etc. Just built from source this morning so seems strange
> that the bug would still persist :(.
>
>   ------------------------------
>  *Från:* Yin Huai <hu...@gmail.com>
> *Till:* user@hive.apache.org; Mikael Öhman <mi...@yahoo.se>
> *Skickat:* tisdag, 17 september 2013 15:30
>
> *Ämne:* Re: Duplicate rows when using group by in subquery
>
> Hello Mikael,
>
> ReduceSinkDeduplication automatically kicked in because it is enabled by
> default. The original plan tries to shuffle your data twice. Then,
> ReduceSinkDeduplication finds that the original plan can be optimized to
> shuffle your data once. But, when picking the partitioning columns, this
> optimizer picked the wrong columns because of the bug.
>
> Also, you can try your query with and without ReduceSinkDeduplication (use
> "set hive.optimize.reducededuplication=false;" to turn this optimization
> off), and see the performance. If the cardinality of "a.Symbol" limits the
> degree of parallelism, two MapReduce jobs may be faster.
>
> Thanks,
>
> Yin
>
>
> On Tue, Sep 17, 2013 at 2:24 AM, Mikael Öhman <mi...@yahoo.se>wrote:
>
> Thank you for the information. Just to be clear, it is not that I have
> manually restricted the job to run using only a single mapreduce job, but
> it incorrectly assumes one job is enough?
>
> I will get back with results from your suggestions ASAP; unfortunately I
> don't have the machines available until Thursday.
>
> / Sincerely Mikael
>
>    *Från:* Yin Huai <hu...@gmail.com>
> *Till:* user@hive.apache.org; Mikael Öhman <mi...@yahoo.se>
> *Skickat:* måndag, 16 september 2013 19:52
> *Ämne:* Re: Duplicate rows when using group by in subquery
>
> Hello Mikael,
>
> Seems your case is related to the bug reported in
> https://issues.apache.org/jira/browse/HIVE-5149. Basically, when hive
> uses a single MapReduce job to evaluate your query, "c.Symbol" and
> "c.catid" are used to partitioning data, and thus, rows with the same value
> of "c.Symbol" are not correctly grouped. If your case, only "c.Symbol"
> should be used if we want to use a single MapReduce job to evaluate this
> query. Can you check the query plan (results of "EXPLAIN") and see if both
> "c.Symbol" and "c.catid" appear in partitioning columns? Or, you can also
> attach your query plan.
>
> This bug have been fixed in trunk. Can you test your query in trunk and
> see if the result is correct. If you are using hive 0.11, you can also
> apply the 0.11 patch (
> https://issues.apache.org/jira/secure/attachment/12601446/HIVE-5149-branch-0.11.patch
> ).
>
> Thanks,
>
> Yin
>
>
> On Mon, Sep 16, 2013 at 4:23 AM, Mikael Öhman <mi...@yahoo.se>wrote:
>
> Hello.
>
> This is basically the same question I posted on stackoverflow:
> http://stackoverflow.com/questions/18812390/hive-subquery-and-group-by/18818115?noredirect=1#18818115
>
> I know the query is a bit noisy. But this query also demonstrates the
> error:
>
> select a.symbol from (select symbol, ordertype from cat group by symbol,
> ordertype) a group by a.symbol;
>
> Now, this query may not make much sense but in my case, because I have 24
> symbols, I expect a result of 24 rows. But I get 48 rows back. A similar
> query:
>
> select a.Symbol,count(*) from (select c.Symbol,c.catid from cat as c group
> by c.Symbol,c.catid) a group by a.Symbol;
>
> returns 864 rows, where I still expect 24 rows... If there are
> alternatives as to how to write the original query in my SO post I would
> much appreciate hearing them. The examples given in this mail have just
> been provided to demonstrate the problem using easier to understand queries
> and I don't need advice on them.
>
> The .csv data and example is from a toy example. My real setup is 6 nodes,
> and the table definition is:
>
> create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE
> int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int,
> ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string,
> SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID
> int) row format delimited fields terminated by ',' stored as ORC;
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.max.dynamic.partitions.pernode=1000;
>
> insert...
>
> Thank you so much for any input.
>
> /Sincerely Mikael
>
>
>
>
>
>
>
>

SV: Duplicate rows when using group by in subquery

Posted by Mikael Öhman <mi...@yahoo.se>.
Hello again.

I have now checked out latest code from trunk and built as per instructions.

However, this query:

select a.Symbol, count(*) 
from (select Symbol, catid from cat group by Symbol, catid) a 
group by a.Symbol;

still returns an incorrect number of rows for table:

create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int, ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string, SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID int) row format delimited fields terminated by ',' stored as ORC;


Here is the result of EXPLAIN:


hive> EXPLAIN select a.Symbol, count(*) 
    > from (select Symbol, catid from cat group by Symbol, catid) a 
    > group by a.Symbol;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME cat))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL Symbol)) (TOK_SELEXPR (TOK_TABLE_OR_COL catid))) (TOK_GROUPBY (TOK_TABLE_OR_COL Symbol) (TOK_TABLE_OR_COL catid)))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) Symbol)) (TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_GROUPBY (. (TOK_TABLE_OR_COL a) Symbol))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a:cat 
          TableScan
            alias: cat
            Select Operator
              expressions:
                    expr: symbol
                    type: string
                    expr: catid
                    type: bigint
              outputColumnNames: symbol, catid
              Group By Operator
                bucketGroup: false
                keys:
                      expr: symbol
                      type: string
                      expr: catid
                      type: bigint
                mode: hash
                outputColumnNames: _col0, _col1
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: string
                        expr: _col1
                        type: bigint
                  sort order: ++
                  Map-reduce partition columns:
                        expr: _col0
                        type: string
                        expr: _col1
                        type: bigint
                  tag: -1
      Reduce Operator Tree:
        Group By Operator
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
                expr: KEY._col1
                type: bigint
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col0
                  type: string
            outputColumnNames: _col0
            Group By Operator
              aggregations:
                    expr: count()
              bucketGroup: false
              keys:
                    expr: _col0
                    type: string
              mode: complete
              outputColumnNames: _col0, _col1
              Select Operator
                expressions:
                      expr: _col0
                      type: string
                      expr: _col1
                      type: bigint
                outputColumnNames: _col0, _col1
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1


Using set hive.optimize.reducededuplication=false;
I get 2 mapreduce jobs and the correct number of rows (24).

Can I verify somehow, maybe through looking in the source code, that I indeed have the correct version? Or execute a command from hive cli that shows version etc. Just built from source this morning so seems strange that the bug would still persist :(.



________________________________
 Från: Yin Huai <hu...@gmail.com>
Till: user@hive.apache.org; Mikael Öhman <mi...@yahoo.se> 
Skickat: tisdag, 17 september 2013 15:30
Ämne: Re: Duplicate rows when using group by in subquery
 


Hello Mikael,

ReduceSinkDeduplication automatically kicked in because it is enabled by default. The original plan tries to shuffle your data twice. Then, ReduceSinkDeduplication finds that the original plan can be optimized to shuffle your data once. But, when picking the partitioning columns, this optimizer picked the wrong columns because of the bug. 

Also, you can try your query with and without ReduceSinkDeduplication (use "set hive.optimize.reducededuplication=false;" to turn this optimization off), and see the performance. If the cardinality of "a.Symbol" limits the degree of parallelism, two MapReduce jobs may be faster.

Thanks,

Yin



On Tue, Sep 17, 2013 at 2:24 AM, Mikael Öhman <mi...@yahoo.se> wrote:

Thank you for the information. Just to be clear, it is not that I have manually restricted the job to run using only a single mapreduce job, but it incorrectly assumes one job is enough?
> 
>I will get back with results from your suggestions ASAP; unfortunately I don't have the machines available until Thursday.
> 
>/ Sincerely Mikael
>
>
>Från: Yin Huai <hu...@gmail.com>
>Till: user@hive.apache.org; Mikael Öhman <mi...@yahoo.se> 
>Skickat: måndag, 16 september 2013 19:52
>Ämne: Re: Duplicate rows when using group by in subquery
> 
>
>
>Hello Mikael,
>
>
>Seems your case is related to the bug reported in https://issues.apache.org/jira/browse/HIVE-5149. Basically, when hive uses a single MapReduce job to evaluate your query, "c.Symbol" and "c.catid" are used to partitioning data, and thus, rows with the same value of "c.Symbol" are not correctly grouped. If your case, only "c.Symbol" should be used if we want to use a single MapReduce job to evaluate this query. Can you check the query plan (results of "EXPLAIN") and see if both "c.Symbol" and "c.catid" appear in partitioning columns? Or, you can also attach your query plan.
>
>
>This bug have been fixed in trunk. Can you test your query in trunk and see if the result is correct. If you are using hive 0.11, you can also apply the 0.11 patch (https://issues.apache.org/jira/secure/attachment/12601446/HIVE-5149-branch-0.11.patch).
>
>
>Thanks,
>
>
>Yin
>
>
>
>On Mon, Sep 16, 2013 at 4:23 AM, Mikael Öhman <mi...@yahoo.se> wrote:
>
>Hello.
>> 
>>This is basically the same question I posted on stackoverflow: http://stackoverflow.com/questions/18812390/hive-subquery-and-group-by/18818115?noredirect=1#18818115
>> 
>>I know the query is a bit noisy. But this query also demonstrates the error:
>> 
>>select a.symbol from (select symbol, ordertype from cat group by symbol, ordertype) a group by a.symbol;
>> 
>>Now, this query may not make much sense but in my case, because I have 24 symbols, I expect a result of 24 rows. But I get 48 rows back. A similar query:
>> 
>>select a.Symbol,count(*) from (select c.Symbol,c.catid from cat as c group by c.Symbol,c.catid) a group by a.Symbol;
>> 
>>returns 864 rows, where I still expect 24 rows... If there are alternatives as to how to write the original query in my SO post I would much appreciate hearing them. The examples given in this mail have just been provided to demonstrate the problem using easier to understand queries and I don't need advice on them.
>> 
>>The .csv data and example is from a toy example. My real setup is 6 nodes, and the table definition is:
>> 
>>create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int, ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string, SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID int) row format delimited fields terminated by ',' stored as ORC;
>>set hive.exec.dynamic.partition=true;
>>set hive.exec.dynamic.partition.mode=nonstrict;
>>set hive.exec.max.dynamic.partitions.pernode=1000;
>> 
>>insert...
>> 
>>Thank you so much for any input.
>> 
>>/Sincerely Mikael
>
>
>

Re: Duplicate rows when using group by in subquery

Posted by Yin Huai <hu...@gmail.com>.
Hello Mikael,

ReduceSinkDeduplication automatically kicked in because it is enabled by
default. The original plan tries to shuffle your data twice. Then,
ReduceSinkDeduplication finds that the original plan can be optimized to
shuffle your data once. But, when picking the partitioning columns, this
optimizer picked the wrong columns because of the bug.

Also, you can try your query with and without ReduceSinkDeduplication (use
"set hive.optimize.reducededuplication=false;" to turn this optimization
off), and see the performance. If the cardinality of "a.Symbol" limits the
degree of parallelism, two MapReduce jobs may be faster.

Thanks,

Yin


On Tue, Sep 17, 2013 at 2:24 AM, Mikael Öhman <mi...@yahoo.se> wrote:

> Thank you for the information. Just to be clear, it is not that I have
> manually restricted the job to run using only a single mapreduce job, but
> it incorrectly assumes one job is enough?
>
> I will get back with results from your suggestions ASAP; unfortunately I
> don't have the machines available until Thursday.
>
> / Sincerely Mikael
>
>    *Från:* Yin Huai <hu...@gmail.com>
> *Till:* user@hive.apache.org; Mikael Öhman <mi...@yahoo.se>
> *Skickat:* måndag, 16 september 2013 19:52
> *Ämne:* Re: Duplicate rows when using group by in subquery
>
> Hello Mikael,
>
> Seems your case is related to the bug reported in
> https://issues.apache.org/jira/browse/HIVE-5149. Basically, when hive
> uses a single MapReduce job to evaluate your query, "c.Symbol" and
> "c.catid" are used to partitioning data, and thus, rows with the same value
> of "c.Symbol" are not correctly grouped. If your case, only "c.Symbol"
> should be used if we want to use a single MapReduce job to evaluate this
> query. Can you check the query plan (results of "EXPLAIN") and see if both
> "c.Symbol" and "c.catid" appear in partitioning columns? Or, you can also
> attach your query plan.
>
> This bug have been fixed in trunk. Can you test your query in trunk and
> see if the result is correct. If you are using hive 0.11, you can also
> apply the 0.11 patch (
> https://issues.apache.org/jira/secure/attachment/12601446/HIVE-5149-branch-0.11.patch
> ).
>
> Thanks,
>
> Yin
>
>
> On Mon, Sep 16, 2013 at 4:23 AM, Mikael Öhman <mi...@yahoo.se>wrote:
>
> Hello.
>
> This is basically the same question I posted on stackoverflow:
> http://stackoverflow.com/questions/18812390/hive-subquery-and-group-by/18818115?noredirect=1#18818115
>
> I know the query is a bit noisy. But this query also demonstrates the
> error:
>
> select a.symbol from (select symbol, ordertype from cat group by symbol,
> ordertype) a group by a.symbol;
>
> Now, this query may not make much sense but in my case, because I have 24
> symbols, I expect a result of 24 rows. But I get 48 rows back. A similar
> query:
>
> select a.Symbol,count(*) from (select c.Symbol,c.catid from cat as c group
> by c.Symbol,c.catid) a group by a.Symbol;
>
> returns 864 rows, where I still expect 24 rows... If there are
> alternatives as to how to write the original query in my SO post I would
> much appreciate hearing them. The examples given in this mail have just
> been provided to demonstrate the problem using easier to understand queries
> and I don't need advice on them.
>
> The .csv data and example is from a toy example. My real setup is 6 nodes,
> and the table definition is:
>
> create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE
> int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int,
> ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string,
> SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID
> int) row format delimited fields terminated by ',' stored as ORC;
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.max.dynamic.partitions.pernode=1000;
>
> insert...
>
> Thank you so much for any input.
>
> /Sincerely Mikael
>
>
>
>
>

SV: Duplicate rows when using group by in subquery

Posted by Mikael Öhman <mi...@yahoo.se>.
Thank you for the information. Just to be clear, it is not that I have manually restricted the job to run using only a single mapreduce job, but it incorrectly assumes one job is enough?
 
I will get back with results from your suggestions ASAP; unfortunately I don't have the machines available until Thursday.
 
/ Sincerely Mikael
 

________________________________
 Från: Yin Huai <hu...@gmail.com>
Till: user@hive.apache.org; Mikael Öhman <mi...@yahoo.se> 
Skickat: måndag, 16 september 2013 19:52
Ämne: Re: Duplicate rows when using group by in subquery
  


Hello Mikael,

Seems your case is related to the bug reported in https://issues.apache.org/jira/browse/HIVE-5149. Basically, when hive uses a single MapReduce job to evaluate your query, "c.Symbol" and "c.catid" are used to partitioning data, and thus, rows with the same value of "c.Symbol" are not correctly grouped. If your case, only "c.Symbol" should be used if we want to use a single MapReduce job to evaluate this query. Can you check the query plan (results of "EXPLAIN") and see if both "c.Symbol" and "c.catid" appear in partitioning columns? Or, you can also attach your query plan. 

This bug have been fixed in trunk. Can you test your query in trunk and see if the result is correct. If you are using hive 0.11, you can also apply the 0.11 patch (https://issues.apache.org/jira/secure/attachment/12601446/HIVE-5149-branch-0.11.patch). 

Thanks,

Yin



On Mon, Sep 16, 2013 at 4:23 AM, Mikael Öhman <mi...@yahoo.se> wrote:

Hello.
>
>This is basically the same question I posted on stackoverflow: http://stackoverflow.com/questions/18812390/hive-subquery-and-group-by/18818115?noredirect=1#18818115 
>
>I know the query is a bit noisy. But this query also demonstrates the error:
>
>select a.symbol from (select symbol, ordertype from cat group by symbol, ordertype) a group by a.symbol; 
>
>Now, this query may not make much sense but in my case, because I have 24 symbols, I expect a result of 24 rows. But I get 48 rows back. A similar query:
>
>select a.Symbol,count(*) from (select c.Symbol,c.catid from cat as c group by c.Symbol,c.catid) a group by a.Symbol;
>
>returns 864 rows, where I still expect 24 rows... If there are alternatives as to how to write the original query in my SO post I would much appreciate hearing them. The examples given in this mail have just been provided to demonstrate the problem using easier to understand queries and I don't need advice on them. 
>
>The .csv data and example is from a toy example. My real setup is 6 nodes, and the table definition is:
>
>create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int, ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string, SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID int) row format delimited fields terminated by ',' stored as ORC;
>set hive.exec.dynamic.partition=true;
>set hive.exec.dynamic.partition.mode=nonstrict;
>set hive.exec.max.dynamic.partitions.pernode=1000;
>
>insert...
>
>Thank you so much for any input. 
>
>/Sincerely Mikael

Re: Duplicate rows when using group by in subquery

Posted by Yin Huai <hu...@gmail.com>.
Hello Mikael,

Seems your case is related to the bug reported in
https://issues.apache.org/jira/browse/HIVE-5149. Basically, when hive uses
a single MapReduce job to evaluate your query, "c.Symbol" and "c.catid" are
used to partitioning data, and thus, rows with the same value of "c.Symbol"
are not correctly grouped. If your case, only "c.Symbol" should be used if
we want to use a single MapReduce job to evaluate this query. Can you check
the query plan (results of "EXPLAIN") and see if both "c.Symbol" and
"c.catid" appear in partitioning columns? Or, you can also attach your
query plan.

This bug have been fixed in trunk. Can you test your query in trunk and see
if the result is correct. If you are using hive 0.11, you can also apply
the 0.11 patch (
https://issues.apache.org/jira/secure/attachment/12601446/HIVE-5149-branch-0.11.patch
).

Thanks,

Yin


On Mon, Sep 16, 2013 at 4:23 AM, Mikael Öhman <mi...@yahoo.se> wrote:

> Hello.
>
> This is basically the same question I posted on stackoverflow:
> http://stackoverflow.com/questions/18812390/hive-subquery-and-group-by/18818115?noredirect=1#18818115
>
> I know the query is a bit noisy. But this query also demonstrates the
> error:
>
> select a.symbol from (select symbol, ordertype from cat group by symbol,
> ordertype) a group by a.symbol;
>
> Now, this query may not make much sense but in my case, because I have 24
> symbols, I expect a result of 24 rows. But I get 48 rows back. A similar
> query:
>
> select a.Symbol,count(*) from (select c.Symbol,c.catid from cat as c group
> by c.Symbol,c.catid) a group by a.Symbol;
>
> returns 864 rows, where I still expect 24 rows... If there are
> alternatives as to how to write the original query in my SO post I would
> much appreciate hearing them. The examples given in this mail have just
> been provided to demonstrate the problem using easier to understand queries
> and I don't need advice on them.
>
> The .csv data and example is from a toy example. My real setup is 6 nodes,
> and the table definition is:
>
> create table cat(CATID bigint, CUSTOMERID int, FILLPRICE double, FILLSIZE
> int, INSTRUMENTTYPE int, ORDERACTION int, ORDERSTATUS int, ORDERTYPE int,
> ORDID string, PRICE double, RECORDTYPE int, SIZE int, SRCORDID string,
> SRCREPID int, TIMESTAMP timestamp) PARTITIONED BY (SYMBOL string, REPID
> int) row format delimited fields terminated by ',' stored as ORC;
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.max.dynamic.partitions.pernode=1000;
>
> insert...
>
> Thank you so much for any input.
>
> /Sincerely Mikael
>