You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Peter Chu <pe...@outlook.com> on 2013/05/06 20:07:22 UTC

Hive Group By Limitations

In Hive, I cannot perform a SELECT GROUP BY on fields not in the GROUP BY clause.
Example: SELECT st.a, st.b, st.c, st.d, FROM some_table st GROUP BY st.a;  -- This does not work.
To make it work, I would need to add the other fields in the group by clause.
Not quite sure but I think each group by will give another M/R job.
Wondering if there is any other way / better way to do group by.
Peter

Re: Hive Group By Limitations

Posted by John Meagher <jo...@gmail.com>.

"Not quite sure but I think each group by will give another M/R job."

It will be done in a single M/R job no matter how many fields are in
the GROUP BY clause.

On Mon, May 6, 2013 at 2:07 PM, Peter Chu <pe...@outlook.com> wrote:
> In Hive, I cannot perform a SELECT GROUP BY on fields not in the GROUP BY
> clause.
>
> Example: SELECT st.a, st.b, st.c, st.d, FROM some_table st GROUP BY st.a;
> -- This does not work.
>
> To make it work, I would need to add the other fields in the group by
> clause.
>
> Not quite sure but I think each group by will give another M/R job.
>
> Wondering if there is any other way / better way to do group by.
>
> Peter

Re: Hive Group By Limitations

Posted by Nitin Pawar <ni...@gmail.com>.

best way to do all this would be run a distinct and group by along side a
join (its just a  guess but a more detailed approach other guys will
suggest )


On Mon, May 6, 2013 at 11:57 PM, Peter Chu <pe...@outlook.com> wrote:

> Thanks Nitin and Michael,
>
> The reason I asked is because I cannot help but wonder if it takes extra
> time with all those group by columns.
>
> Say for example, I have a employees table with 10 columns pertaining to
> employees but there could be duplicates, I need to de dup it by performing
> a group by employee id, and hire date, but were I want to select the other
> fields as well, I would have to add those fields in the group by clause, or
> join it back in another table
>
> When the table is big, I cannot help but wonder if it hurts the
> performance.
>
> Peter
>
> ------------------------------
> Date: Mon, 6 May 2013 23:42:23 +0530
> Subject: Re: Hive Group By Limitations
> From: nitinpawar432@gmail.com
> To: user@hive.apache.org
>
>
> hi Peter,
>
> In hive if you are running a group by, then all the select columns have to
> be in the group by clause. This limitation is for the column definition
> only and not for the column operations like count etc
>
> All the columns for group by do go to a single map reduce job and it does
> not launch multiple mapreduce jobs for each group by.
>
> I am not sure what do you mean by better way?
>
>
>
>
> On Mon, May 6, 2013 at 11:37 PM, Peter Chu <pe...@outlook.com> wrote:
>
> In Hive, I cannot perform a SELECT GROUP BY on fields not in the GROUP BY
> clause.
>
> Example: SELECT st.a, st.b, st.c, st.d, FROM some_table st GROUP BY st.a;
>  -- This does not work.
>
> To make it work, I would need to add the other fields in the group by
> clause.
>
> Not quite sure but I think each group by will give another M/R job.
>
> Wondering if there is any other way / better way to do group by.
>
> Peter
>
>
>
>
> --
> Nitin Pawar
>



-- 
Nitin Pawar

RE: Hive Group By Limitations

Posted by Peter Chu <pe...@outlook.com>.

Thanks Nitin and Michael,
The reason I asked is because I cannot help but wonder if it takes extra time with all those group by columns.  
Say for example, I have a employees table with 10 columns pertaining to employees but there could be duplicates, I need to de dup it by performing a group by employee id, and hire date, but were I want to select the other fields as well, I would have to add those fields in the group by clause, or join it back in another table
When the table is big, I cannot help but wonder if it hurts the performance.
Peter

Date: Mon, 6 May 2013 23:42:23 +0530
Subject: Re: Hive Group By Limitations
From: nitinpawar432@gmail.com
To: user@hive.apache.org

hi Peter,
In hive if you are running a group by, then all the select columns have to be in the group by clause. This limitation is for the column definition only and not for the column operations like count etc

All the columns for group by do go to a single map reduce job and it does not launch multiple mapreduce jobs for each group by. 
I am not sure what do you mean by better way? 

On Mon, May 6, 2013 at 11:37 PM, Peter Chu <pe...@outlook.com> wrote:

In Hive, I cannot perform a SELECT GROUP BY on fields not in the GROUP BY clause.
Example: SELECT st.a, st.b, st.c, st.d, FROM some_table st GROUP BY st.a;  -- This does not work.

To make it work, I would need to add the other fields in the group by clause.
Not quite sure but I think each group by will give another M/R job.
Wondering if there is any other way / better way to do group by.

Peter 		 	   		  

-- 
Nitin Pawar

Re: Hive Group By Limitations

Posted by Nitin Pawar <ni...@gmail.com>.

hi Peter,

In hive if you are running a group by, then all the select columns have to
be in the group by clause. This limitation is for the column definition
only and not for the column operations like count etc

All the columns for group by do go to a single map reduce job and it does
not launch multiple mapreduce jobs for each group by.

I am not sure what do you mean by better way?

On Mon, May 6, 2013 at 11:37 PM, Peter Chu <pe...@outlook.com> wrote:

> In Hive, I cannot perform a SELECT GROUP BY on fields not in the GROUP BY
> clause.
>
> Example: SELECT st.a, st.b, st.c, st.d, FROM some_table st GROUP BY st.a;
>  -- This does not work.
>
> To make it work, I would need to add the other fields in the group by
> clause.
>
> Not quite sure but I think each group by will give another M/R job.
>
> Wondering if there is any other way / better way to do group by.
>
> Peter
>

-- 
Nitin Pawar