You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Lior Chaga <li...@taboola.com> on 2015/07/16 14:10:58 UTC

Use rank with distribute by in HiveContext

Does spark HiveContext support the rank() ... distribute by syntax (as in
the following article-
http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/doing_rank_with_hive
)?

If not, how can it be achieved?

Thanks,
Lior

RE: Use rank with distribute by in HiveContext

Posted by java8964 <ja...@hotmail.com>.

Yes. The HIVE UDF and distribute by both supported by Spark SQL.
If you are using Spark 1.4, you can try Hive analytics windows function (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics),most of which are already supported in Spark 1.4, so you don't need the customize UDF of rank.
Yong
Date: Thu, 16 Jul 2015 15:10:58 +0300
Subject: Use rank with distribute by in HiveContext
From: lior.c@taboola.com
To: user@spark.apache.org

Does spark HiveContext support the rank() ... distribute by syntax (as in the following article- http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/doing_rank_with_hive )?
If not, how can it be achieved?
Thanks,Lior

Re: Use rank with distribute by in HiveContext

Posted by Todd Nist <ts...@gmail.com>.

Did you take a look at the excellent write up by Yin Huai and Michael
Armbrust?  It appears that rank is supported in the 1.4.x release.

https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

Snippet from above article for your convenience:

To answer the first question “*What are the best-selling and the second
best-selling products in every category?*”, we need to rank products in a
category based on their revenue, and to pick the best selling and the
second best-selling products based the ranking. Below is the SQL query used
to answer this question by using window function dense_rank (we will
explain the syntax of using window functions in next section).

SELECT
  product,
  category,
  revenueFROM (
  SELECT
    product,
    category,
    revenue,
    dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank
  FROM productRevenue) tmpWHERE
  rank <= 2

The result of this query is shown below. Without using window functions, it
is very hard to express the query in SQL, and even if a SQL query can be
expressed, it is hard for the underlying engine to efficiently evaluate the
query.

[image: 1-2]

SQLDataFrame APIRanking functionsrankrankdense_rankdenseRankpercent_rank
percentRankntilentilerow_numberrowNumber

 HTH.

-Todd

On Thu, Jul 16, 2015 at 8:10 AM, Lior Chaga <li...@taboola.com> wrote:

> Does spark HiveContext support the rank() ... distribute by syntax (as in
> the following article-
> http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/doing_rank_with_hive
> )?
>
> If not, how can it be achieved?
>
> Thanks,
> Lior
>