You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Carol McDonald <cm...@maprtech.com> on 2015/07/20 18:25:08 UTC

dataframes sql order by not total ordering

the following query on the Movielens dataset , is sorting by the count of
ratings for a movie.  It looks like the results  are ordered  by partition
?
scala> val results =sqlContext.sql("select movies.title, movierates.maxr,
movierates.minr, movierates.cntu from(SELECT ratings.product,
max(ratings.rating) as maxr, min(ratings.rating) as minr,count(distinct
user) as cntu FROM ratings group by ratings.product order by cntu desc)
movierates join movies on movierates.product=movies.movieId ")

scala> results.take(30).foreach(println)
[Right Stuff, The (1983),5.0,1.0,750]
[Lost in Space (1998),5.0,1.0,667]
[Dumb & Dumber (1994),5.0,1.0,660]
[Patch Adams (1998),5.0,1.0,474]
[Carlito's Way (1993),5.0,1.0,369]
[Rounders (1998),5.0,1.0,345]
[Bedknobs and Broomsticks (1971),5.0,1.0,319]
[Beverly Hills Ninja (1997),5.0,1.0,232]
[Saving Grace (2000),5.0,1.0,186]
[Dangerous Minds (1995),5.0,1.0,141]
[Death Wish II (1982),5.0,1.0,85]
[All Dogs Go to Heaven 2 (1996),5.0,1.0,75]
[Repossessed (1990),4.0,1.0,53]
[Assignment, The (1997),5.0,1.0,49]
[$1,000,000 Duck (1971),5.0,1.0,37]
[Stonewall (1995),5.0,1.0,20]
[Dog of Flanders, A (1999),5.0,1.0,8]
[Frogs for Snakes (1998),3.0,1.0,5]
[It's in the Water (1998),3.0,2.0,3]
[Twelve Monkeys (1995),5.0,1.0,1511]
[Ransom (1996),5.0,1.0,564]
[Alice in Wonderland (1951),5.0,1.0,525]
[City Slickers II: The Legend of Curly's Gold (1994),5.0,1.0,392]
[Eat Drink Man Woman (1994),5.0,1.0,346]
[Cube (1997),5.0,1.0,233]
[Omega Man, The (1971),5.0,1.0,224]
[Stepmom (1998),5.0,1.0,146]
[Metro (1997),5.0,1.0,100]
[Death Wish 3 (1985),5.0,1.0,72]
[Stalker (1979),5.0,1.0,52]

Re: dataframes sql order by not total ordering

Posted by Carol McDonald <cm...@maprtech.com>.
Thanks, that works a lot better ;)

scala> val results =sqlContext.sql("select movies.title, movierates.maxr,
movierates.minr, movierates.cntu from(SELECT ratings.product,
max(ratings.rating) as maxr, min(ratings.rating) as minr,count(distinct
user) as cntu FROM ratings group by ratings.product ) movierates join
movies on movierates.product=movies.movieId order by movierates.cntu desc ")
results: org.apache.spark.sql.DataFrame = [title: string, maxr: double,
minr: double, cntu: bigint]

scala> results.show
title                maxr minr cntu
American Beauty (... 5.0  1.0  3428
Star Wars: Episod... 5.0  1.0  2991
Star Wars: Episod... 5.0  1.0  2990
Star Wars: Episod... 5.0  1.0  2883
Jurassic Park (1993) 5.0  1.0  2672
Saving Private Ry... 5.0  1.0  2653

On Mon, Jul 20, 2015 at 6:09 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> An ORDER BY needs to be on the outermost query otherwise subsequent
> operations (such as the join) could reorder the tuples.
>
> On Mon, Jul 20, 2015 at 9:25 AM, Carol McDonald <cm...@maprtech.com>
> wrote:
>
>> the following query on the Movielens dataset , is sorting by the count of
>> ratings for a movie.  It looks like the results  are ordered  by partition
>> ?
>> scala> val results =sqlContext.sql("select movies.title, movierates.maxr,
>> movierates.minr, movierates.cntu from(SELECT ratings.product,
>> max(ratings.rating) as maxr, min(ratings.rating) as minr,count(distinct
>> user) as cntu FROM ratings group by ratings.product order by cntu desc)
>> movierates join movies on movierates.product=movies.movieId ")
>>
>> scala> results.take(30).foreach(println)
>> [Right Stuff, The (1983),5.0,1.0,750]
>> [Lost in Space (1998),5.0,1.0,667]
>> [Dumb & Dumber (1994),5.0,1.0,660]
>> [Patch Adams (1998),5.0,1.0,474]
>> [Carlito's Way (1993),5.0,1.0,369]
>> [Rounders (1998),5.0,1.0,345]
>> [Bedknobs and Broomsticks (1971),5.0,1.0,319]
>> [Beverly Hills Ninja (1997),5.0,1.0,232]
>> [Saving Grace (2000),5.0,1.0,186]
>> [Dangerous Minds (1995),5.0,1.0,141]
>> [Death Wish II (1982),5.0,1.0,85]
>> [All Dogs Go to Heaven 2 (1996),5.0,1.0,75]
>> [Repossessed (1990),4.0,1.0,53]
>> [Assignment, The (1997),5.0,1.0,49]
>> [$1,000,000 Duck (1971),5.0,1.0,37]
>> [Stonewall (1995),5.0,1.0,20]
>> [Dog of Flanders, A (1999),5.0,1.0,8]
>> [Frogs for Snakes (1998),3.0,1.0,5]
>> [It's in the Water (1998),3.0,2.0,3]
>> [Twelve Monkeys (1995),5.0,1.0,1511]
>> [Ransom (1996),5.0,1.0,564]
>> [Alice in Wonderland (1951),5.0,1.0,525]
>> [City Slickers II: The Legend of Curly's Gold (1994),5.0,1.0,392]
>> [Eat Drink Man Woman (1994),5.0,1.0,346]
>> [Cube (1997),5.0,1.0,233]
>> [Omega Man, The (1971),5.0,1.0,224]
>> [Stepmom (1998),5.0,1.0,146]
>> [Metro (1997),5.0,1.0,100]
>> [Death Wish 3 (1985),5.0,1.0,72]
>> [Stalker (1979),5.0,1.0,52]
>>
>
>

Re: dataframes sql order by not total ordering

Posted by Michael Armbrust <mi...@databricks.com>.
An ORDER BY needs to be on the outermost query otherwise subsequent
operations (such as the join) could reorder the tuples.

On Mon, Jul 20, 2015 at 9:25 AM, Carol McDonald <cm...@maprtech.com>
wrote:

> the following query on the Movielens dataset , is sorting by the count of
> ratings for a movie.  It looks like the results  are ordered  by partition
> ?
> scala> val results =sqlContext.sql("select movies.title, movierates.maxr,
> movierates.minr, movierates.cntu from(SELECT ratings.product,
> max(ratings.rating) as maxr, min(ratings.rating) as minr,count(distinct
> user) as cntu FROM ratings group by ratings.product order by cntu desc)
> movierates join movies on movierates.product=movies.movieId ")
>
> scala> results.take(30).foreach(println)
> [Right Stuff, The (1983),5.0,1.0,750]
> [Lost in Space (1998),5.0,1.0,667]
> [Dumb & Dumber (1994),5.0,1.0,660]
> [Patch Adams (1998),5.0,1.0,474]
> [Carlito's Way (1993),5.0,1.0,369]
> [Rounders (1998),5.0,1.0,345]
> [Bedknobs and Broomsticks (1971),5.0,1.0,319]
> [Beverly Hills Ninja (1997),5.0,1.0,232]
> [Saving Grace (2000),5.0,1.0,186]
> [Dangerous Minds (1995),5.0,1.0,141]
> [Death Wish II (1982),5.0,1.0,85]
> [All Dogs Go to Heaven 2 (1996),5.0,1.0,75]
> [Repossessed (1990),4.0,1.0,53]
> [Assignment, The (1997),5.0,1.0,49]
> [$1,000,000 Duck (1971),5.0,1.0,37]
> [Stonewall (1995),5.0,1.0,20]
> [Dog of Flanders, A (1999),5.0,1.0,8]
> [Frogs for Snakes (1998),3.0,1.0,5]
> [It's in the Water (1998),3.0,2.0,3]
> [Twelve Monkeys (1995),5.0,1.0,1511]
> [Ransom (1996),5.0,1.0,564]
> [Alice in Wonderland (1951),5.0,1.0,525]
> [City Slickers II: The Legend of Curly's Gold (1994),5.0,1.0,392]
> [Eat Drink Man Woman (1994),5.0,1.0,346]
> [Cube (1997),5.0,1.0,233]
> [Omega Man, The (1971),5.0,1.0,224]
> [Stepmom (1998),5.0,1.0,146]
> [Metro (1997),5.0,1.0,100]
> [Death Wish 3 (1985),5.0,1.0,72]
> [Stalker (1979),5.0,1.0,52]
>