You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Guy Doulberg <Gu...@conduit.com> on 2011/01/25 17:25:36 UTC
Distinct in hive
Hey,
We made a query in hive, that calculates the number of distinct values in a group by.
On small portion of data it worked well, however when we ran the query over large portion of data, we failed because OutOfMemory in some of the reducers.
We wonder how is the distinct operator works in HIVE? Does it use some sort of data structure that its size is proportional to the number of distinct values?
Many thanks
RE: Distinct in hive
Posted by Guy Doulberg <Gu...@conduit.com>.
Thanks
That was it
From: Namit Jain [mailto:njain@fb.com]
Sent: Tuesday, January 25, 2011 7:04 PM
To: user@hive.apache.org
Subject: Re: Distinct in hive
Is there skew in data ?
You may want to set the parameter: hive.groupby.skewindata: to true.
Thanks,
-namit
From: Guy Doulberg <Gu...@conduit.com>>
Reply-To: <us...@hive.apache.org>>
Date: Tue, 25 Jan 2011 08:25:36 -0800
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Distinct in hive
Hey,
We made a query in hive, that calculates the number of distinct values in a group by.
On small portion of data it worked well, however when we ran the query over large portion of data, we failed because OutOfMemory in some of the reducers.
We wonder how is the distinct operator works in HIVE? Does it use some sort ofdata structure that its size is proportional to the number of distinct values?
Many thanks
Re: Distinct in hive
Posted by Namit Jain <nj...@fb.com>.
Is there skew in data ?
You may want to set the parameter: hive.groupby.skewindata: to true.
Thanks,
-namit
From: Guy Doulberg <Gu...@conduit.com>>
Reply-To: <us...@hive.apache.org>>
Date: Tue, 25 Jan 2011 08:25:36 -0800
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Subject: Distinct in hive
Hey,
We made a query in hive, that calculates the number of distinct values in a group by.
On small portion of data it worked well, however when we ran the query over large portion of data, we failed because OutOfMemory in some of the reducers.
We wonder how is the distinct operator works in HIVE? Does it use some sort ofdata structure that its size is proportional to the number of distinct values?
Many thanks