You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Eric Chu <ec...@rocketfuel.com> on 2014/07/11 10:08:18 UTC

difference between partition by and distribute by in rank()

Does anyone know what

*rank() over(distribute by p_mfgr sort by p_name) *

does exactly and how it's different from

*rank() over(partition by p_mfgr order by p_name)*?

Thanks,

Eric

Re: difference between partition by and distribute by in rank()

Posted by Nitin Pawar <ni...@gmail.com>.
In general principle,
distribute by  ensures each of N reducers gets non-overlapping ranges of X ,
but doesn't sort the output of each reducer. You end up with N or unsorted
files with non-overlapping ranges. So this is more of a horizontal
distribution of data.

In my view,
Partition by is more based on values so its vertical distribution of data.

I may be wrong in understanding this




On Fri, Jul 11, 2014 at 1:38 PM, Eric Chu <ec...@rocketfuel.com> wrote:

> Does anyone know what
>
> *rank() over(distribute by p_mfgr sort by p_name) *
>
> does exactly and how it's different from
>
> *rank() over(partition by p_mfgr order by p_name)*?
>
> Thanks,
>
> Eric
>
>


-- 
Nitin Pawar

Re: difference between partition by and distribute by in rank()

Posted by Eric Chu <ec...@rocketfuel.com>.
Thanks for the responses. I understand DISTRIBUTE BY and SORT BY in the
normal case (as described in the Hive doc); I just don't understand their
behavior in the OVER clause with RANK, which apparently you can do. See
ql/src/test/queries/clientpositive/windowing.q for example.

Yes I saw Edward's Blog. His solution is a UDF, while Hive's rank() is
UDAF. Also, if you use his function, let's say you do

DISTRIBUTE BY user, SORT BY score DESC

then RANK(user) on that

The UDF would just give a different rank for each row within the same user
group, but it can't give the same rank for different rows in the same user
group that have the same score. (

Hive's rank() OVER PARTITION BY seems to support this in the iterate()
method. Also, the function is applied to a single partition (in this case,
per user group), as opposed to a single reducer that may see different
partitions, and the prev/current row comparison is done on the PARTITION BY
columns.

The actual problem I'm hitting is that when I use Hive's rank(), I run into
OOM issue when it adds a rank to an ArrayList in RankBuffer class in
GenericUDAFRank.java. The same problem occurs with both RANK OVER /
DISTRIBUTE BY / SORT BY and RANK OVER / PARTITION BY / ORDER BY. So I want
to understand if there's a mitigation other than increasing the heap.

If not, I'll have to go back to the UDF approach, which just outputs a rank
for each row so it doesn't have the OOM issue. But since this is going
through rows on a reducer, I'd need to distinguish between (with DISTRIBUTE
BY columns and the SORT BY columns in the UDF, so that it supports giving
the same rank for the rows with the same SORT BY column values.






On Fri, Jul 11, 2014 at 2:31 AM, Joshi, Rekha <Re...@intuit.com>
wrote:

>  Hi,
>
>  Quite known, are order and sort reducer nuances related to total order
> in final output.
>
>  One could *simulate* rank over() functionality by using* distribute by
> () /sort by() on datasets*{cluster by/ if same key} as in Edward Blog
> <http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/doing_rank_with_hive>
> .
>
>  From Hive0.11, you can have directly
> <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics#LanguageManualWindowingAndAnalytics-WindowingandAnalyticsFunctions> call
> rank() over (partition ..order..).
>
>  AFAIK, in hive rank over() syntax uses (partition ..order..) only.
>
>  Thanks
> Rekha
>
>   From: Eric Chu <ec...@rocketfuel.com>
> Reply-To: "user@hive.apache.org" <us...@hive.apache.org>
> Date: Friday, July 11, 2014 at 1:38 PM
> To: "hive-user@hadoop.apache.org" <hi...@hadoop.apache.org>
> Subject: difference between partition by and distribute by in rank()
>
>   Does anyone know what
>
> *rank() over(distribute by p_mfgr sort by p_name) *
>
> does exactly and how it's different from
>
> *rank() over(partition by p_mfgr order by p_name)*?
>
> Thanks,
>
> Eric
>
>

Re: difference between partition by and distribute by in rank()

Posted by "Joshi, Rekha" <Re...@intuit.com>.
Hi,

Quite known, are order and sort reducer nuances related to total order in final output.

One could simulate rank over() functionality by using distribute by () /sort by() on datasets{cluster by/ if same key} as in Edward Blog<http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/doing_rank_with_hive>.

>From Hive0.11, you can have directly<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics#LanguageManualWindowingAndAnalytics-WindowingandAnalyticsFunctions> call rank() over (partition ..order..).

AFAIK, in hive rank over() syntax uses (partition ..order..) only.

Thanks
Rekha

From: Eric Chu <ec...@rocketfuel.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Friday, July 11, 2014 at 1:38 PM
To: "hive-user@hadoop.apache.org<ma...@hadoop.apache.org>" <hi...@hadoop.apache.org>>
Subject: difference between partition by and distribute by in rank()

Does anyone know what

rank() over(distribute by p_mfgr sort by p_name)

does exactly and how it's different from

rank() over(partition by p_mfgr order by p_name)?


Thanks,


Eric