You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Gourav Sengupta <go...@gmail.com> on 2013/06/12 14:51:14 UTC

Fwd: GROUP BY Issue

Hi,

I had initially forwarded this request to the user group but am yet to
receive any response.

I will be grateful if someone can help me out in resolving the issue or
pointing out any mistakes that I may be doing.

It took me around 5 to 6 hours to generate the test data of around 20 GB
(or more) and there must be a better alternative.

Regards,
Gourav

---------- Forwarded message ----------
From: Gourav Sengupta <go...@gmail.com>
Date: Mon, Jun 10, 2013 at 4:10 PM
Subject: GROUP BY Issue
To: user@hive.apache.org


Hi,

On running the following query I am getting multiple records with same
value of F1

SELECT F1, COUNT(*)
FROM
(
SELECT F1, F2, COUNT(*)
FROM TABLE1
GROUP BY F1, F2
) a
GROUP BY F1;

As per what I understand there are multiple number of records based on
number of reducers.

Replicating the test scenario:
STEP1: get the dataset as available in
http://snap.stanford.edu/data/amazon0302.html

STEP2: Open the file and delete the heading

STEP3: hadoop fs -mkdir /test

STEP4: hadoop fs -put amazon0302.txt /test

STEP5: create external table test (f1 int, f2 int) row format delimited
fields terminated by '\t' lines terminated by '\n' stored as textfile
location '/test';

STEP6: create table test1 location '/test1' as select left_table.* from
(select * from test where f1<10000) left_table join (select * from test
where f1 < 10000) right_table;

STEP7: hadoop fs -mkdir /test2

STEP8: create table test2 location '/test2' as select f1, count(*) from
(select f1, f2, count(*) from test1 group by f1, f2) a group by f1;

STEP9: select * from test2 where f1 = 9887;

ENVIRONMENT:
HADOOP 2.0.4
HIVE 0.11

Please do let me know whether I am doing anything wrong.


Thanks and Regards,
Gourav Sengupta

Re: GROUP BY Issue

Posted by Gourav Sengupta <go...@gmail.com>.
Hi,

There is a bug in HIVE 0.11 and my earlier email clearly explains how the
issue can be replicated.

I am grateful to have received a pertinent response here
http://hortonworks.com/community/forums/topic/hive-group-by-issue/#post-27613

Setting the number of reducers to 1 does resolve the issue, but does not
sound like an optimal solution.

As far as I think this is an issue with the Query optimizer and will be
obliged to know whether this issue is being resolved.


THE ISSUE (once again):

The following query should give me one row for multiple rows of f1 = 9887
in TABLE1. But it is giving more than one row based on the number of
reducers which are running.

SELECT F1, COUNT(*)
FROM
(
SELECT F1, F2, COUNT(*)
FROM TABLE1
GROUP BY F1, F2
) a
GROUP BY F1;

Steps for replicating this scenario is mentioned in my earlier email, and
it is meant to create duplicate rows by large numbers in order to create
multiple number of reducers. Around 30 GB of data was generated by the
steps mentioned above by running the data generation query for over 6 hours.



Thanks and Regards,
Gourav







On Fri, Jun 14, 2013 at 3:27 AM, Navis류승우 <na...@nexr.com> wrote:

> Your table has five "F1 = 9887" rows and joining will make 25 rows
> with the same F1 value.
>
> I cannot imagine what you're intended to do.
>
> 2013/6/12 Gourav Sengupta <go...@gmail.com>:
> > Hi,
> >
> > I had initially forwarded this request to the user group but am yet to
> > receive any response.
> >
> > I will be grateful if someone can help me out in resolving the issue or
> > pointing out any mistakes that I may be doing.
> >
> > It took me around 5 to 6 hours to generate the test data of around 20 GB
> > (or more) and there must be a better alternative.
> >
> > Regards,
> > Gourav
> >
> > ---------- Forwarded message ----------
> > From: Gourav Sengupta <go...@gmail.com>
> > Date: Mon, Jun 10, 2013 at 4:10 PM
> > Subject: GROUP BY Issue
> > To: user@hive.apache.org
> >
> >
> > Hi,
> >
> > On running the following query I am getting multiple records with same
> > value of F1
> >
> > SELECT F1, COUNT(*)
> > FROM
> > (
> > SELECT F1, F2, COUNT(*)
> > FROM TABLE1
> > GROUP BY F1, F2
> > ) a
> > GROUP BY F1;
> >
> > As per what I understand there are multiple number of records based on
> > number of reducers.
> >
> > Replicating the test scenario:
> > STEP1: get the dataset as available in
> > http://snap.stanford.edu/data/amazon0302.html
> >
> > STEP2: Open the file and delete the heading
> >
> > STEP3: hadoop fs -mkdir /test
> >
> > STEP4: hadoop fs -put amazon0302.txt /test
> >
> > STEP5: create external table test (f1 int, f2 int) row format delimited
> > fields terminated by '\t' lines terminated by '\n' stored as textfile
> > location '/test';
> >
> > STEP6: create table test1 location '/test1' as select left_table.* from
> > (select * from test where f1<10000) left_table join (select * from test
> > where f1 < 10000) right_table;
> >
> > STEP7: hadoop fs -mkdir /test2
> >
> > STEP8: create table test2 location '/test2' as select f1, count(*) from
> > (select f1, f2, count(*) from test1 group by f1, f2) a group by f1;
> >
> > STEP9: select * from test2 where f1 = 9887;
> >
> > ENVIRONMENT:
> > HADOOP 2.0.4
> > HIVE 0.11
> >
> > Please do let me know whether I am doing anything wrong.
> >
> >
> > Thanks and Regards,
> > Gourav Sengupta
>

Re: GROUP BY Issue

Posted by Navis류승우 <na...@nexr.com>.
Your table has five "F1 = 9887" rows and joining will make 25 rows
with the same F1 value.

I cannot imagine what you're intended to do.

2013/6/12 Gourav Sengupta <go...@gmail.com>:
> Hi,
>
> I had initially forwarded this request to the user group but am yet to
> receive any response.
>
> I will be grateful if someone can help me out in resolving the issue or
> pointing out any mistakes that I may be doing.
>
> It took me around 5 to 6 hours to generate the test data of around 20 GB
> (or more) and there must be a better alternative.
>
> Regards,
> Gourav
>
> ---------- Forwarded message ----------
> From: Gourav Sengupta <go...@gmail.com>
> Date: Mon, Jun 10, 2013 at 4:10 PM
> Subject: GROUP BY Issue
> To: user@hive.apache.org
>
>
> Hi,
>
> On running the following query I am getting multiple records with same
> value of F1
>
> SELECT F1, COUNT(*)
> FROM
> (
> SELECT F1, F2, COUNT(*)
> FROM TABLE1
> GROUP BY F1, F2
> ) a
> GROUP BY F1;
>
> As per what I understand there are multiple number of records based on
> number of reducers.
>
> Replicating the test scenario:
> STEP1: get the dataset as available in
> http://snap.stanford.edu/data/amazon0302.html
>
> STEP2: Open the file and delete the heading
>
> STEP3: hadoop fs -mkdir /test
>
> STEP4: hadoop fs -put amazon0302.txt /test
>
> STEP5: create external table test (f1 int, f2 int) row format delimited
> fields terminated by '\t' lines terminated by '\n' stored as textfile
> location '/test';
>
> STEP6: create table test1 location '/test1' as select left_table.* from
> (select * from test where f1<10000) left_table join (select * from test
> where f1 < 10000) right_table;
>
> STEP7: hadoop fs -mkdir /test2
>
> STEP8: create table test2 location '/test2' as select f1, count(*) from
> (select f1, f2, count(*) from test1 group by f1, f2) a group by f1;
>
> STEP9: select * from test2 where f1 = 9887;
>
> ENVIRONMENT:
> HADOOP 2.0.4
> HIVE 0.11
>
> Please do let me know whether I am doing anything wrong.
>
>
> Thanks and Regards,
> Gourav Sengupta