You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Kannan Rajah <kr...@maprtech.com> on 2015/02/19 03:33:21 UTC

Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

According to hive documentation, "sort by" is supposed to order the results
for each reducer. So if we set a single reducer, then the results should be
sorted, right? But this is not happening. Any idea why? Looks like the
settings I am using to restrict the number of reducers is not having an
effect.

*Tried the following:*

Set spark.default.parallelism to 1

Set spark.sql.shuffle.partitions to 1

These were set in hive-site.xml and also inside spark shell.


*Spark-SQL*

create table if not exists testSortBy (key int, name string, age int);
LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE
testSortBy;
select * from testSortBY;

1    Aditya    28
2    aash    25
3    prashanth    27
4    bharath    26
5    terry    27
6    nanda    26
7    pradeep    27
8    pratyay    26


set spark.default.parallelism=1;

set spark.sql.shuffle.partitions=1;

select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth
27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age
from testSortBy sort by age;

aash    25
bharath    26
nanda    26
pratyay    26
prashanth    27
terry    27
pradeep    27
Aditya    28


--
Kannan

RE: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

Posted by "Cheng, Hao" <ha...@intel.com>.

If in that case, I suggest you need to use “order by” instead of the “sort by” for Spark SQL if you think the sort result is very important to you. If not the case (reducer count > 1), I didn’t see any reason that Spark SQL should output the same result as Hive does, as they have totally different partitioner function.

Some more discussion can be found at:
https://github.com/apache/spark/pull/3496 (BE NOTICE: the stackoverflow in the description is not the correct answer.)

From: Cheng, Hao [mailto:hao.cheng@intel.com]
Sent: Thursday, February 26, 2015 8:32 AM
To: Kannan Rajah; Cheng Lian
Cc: user@spark.apache.org
Subject: RE: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

How many reducers you set for Hive? With small data set, Hive will run in local mode, which will set the reducer count always as 1.

From: Kannan Rajah [mailto:krajah@maprtech.com]
Sent: Thursday, February 26, 2015 3:02 AM
To: Cheng Lian
Cc: user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

Cheng, We tried this setting and it still did not help. This was on Spark 1.2.0.

--
Kannan

On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian <li...@gmail.com>> wrote:

(Move to user list.)

Hi Kannan,

You need to set mapred.map.tasks to 1 in hive-site.xml. The reason is this line of code<https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68>, which overrides spark.default.parallelism. Also, spark.sql.shuffle.parallelism isn’t used here since there’s no shuffle involved (we only need to sort within a partition).

Default value of mapred.map.tasks is 2<https://hadoop.apache.org/docs/r1.0.4/mapred-default.html>. You may see that the Spark SQL result can be divided into two sorted parts from the middle.

Cheng

On 2/19/15 10:33 AM, Kannan Rajah wrote:

According to hive documentation, "sort by" is supposed to order the results

for each reducer. So if we set a single reducer, then the results should be

sorted, right? But this is not happening. Any idea why? Looks like the

settings I am using to restrict the number of reducers is not having an

effect.

*Tried the following:*

Set spark.default.parallelism to 1

Set spark.sql.shuffle.partitions to 1

These were set in hive-site.xml and also inside spark shell.

*Spark-SQL*

create table if not exists testSortBy (key int, name string, age int);

LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE

testSortBy;

select * from testSortBY;

1    Aditya    28

2    aash    25

3    prashanth    27

4    bharath    26

5    terry    27

6    nanda    26

7    pradeep    27

8    pratyay    26

set spark.default.parallelism=1;

set spark.sql.shuffle.partitions=1;

select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth

27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age

from testSortBy sort by age;

aash    25

bharath    26

nanda    26

pratyay    26

prashanth    27

terry    27

pradeep    27

Aditya    28

--

Kannan

RE: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

Posted by "Cheng, Hao" <ha...@intel.com>.

How many reducers you set for Hive? With small data set, Hive will run in local mode, which will set the reducer count always as 1.

From: Kannan Rajah [mailto:krajah@maprtech.com]
Sent: Thursday, February 26, 2015 3:02 AM
To: Cheng Lian
Cc: user@spark.apache.org
Subject: Re: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

Cheng, We tried this setting and it still did not help. This was on Spark 1.2.0.

--
Kannan

On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian <li...@gmail.com>> wrote:

(Move to user list.)

Hi Kannan,

You need to set mapred.map.tasks to 1 in hive-site.xml. The reason is this line of code<https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68>, which overrides spark.default.parallelism. Also, spark.sql.shuffle.parallelism isn’t used here since there’s no shuffle involved (we only need to sort within a partition).

Default value of mapred.map.tasks is 2<https://hadoop.apache.org/docs/r1.0.4/mapred-default.html>. You may see that the Spark SQL result can be divided into two sorted parts from the middle.

Cheng

On 2/19/15 10:33 AM, Kannan Rajah wrote:

According to hive documentation, "sort by" is supposed to order the results

for each reducer. So if we set a single reducer, then the results should be

sorted, right? But this is not happening. Any idea why? Looks like the

settings I am using to restrict the number of reducers is not having an

effect.

*Tried the following:*

Set spark.default.parallelism to 1

Set spark.sql.shuffle.partitions to 1

These were set in hive-site.xml and also inside spark shell.

*Spark-SQL*

create table if not exists testSortBy (key int, name string, age int);

LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE

testSortBy;

select * from testSortBY;

1    Aditya    28

2    aash    25

3    prashanth    27

4    bharath    26

5    terry    27

6    nanda    26

7    pradeep    27

8    pratyay    26

set spark.default.parallelism=1;

set spark.sql.shuffle.partitions=1;

select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth

27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age

from testSortBy sort by age;

aash    25

bharath    26

nanda    26

pratyay    26

prashanth    27

terry    27

pradeep    27

Aditya    28

--

Kannan

Re: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

Posted by Cheng Lian <li...@gmail.com>.

Could you check the Spark web UI for the number of tasks issued when the 
query is executed? I digged out |mapred.map.tasks| because I saw 2 tasks 
were issued.

On 2/26/15 3:01 AM, Kannan Rajah wrote:

> Cheng, We tried this setting and it still did not help. This was on 
> Spark 1.2.0.
>
>
> --
> Kannan
>
> On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian <lian.cs.zju@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     (Move to user list.)
>
>     Hi Kannan,
>
>     You need to set |mapred.map.tasks| to 1 in hive-site.xml. The
>     reason is this line of code
>     <https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68>,
>     which overrides |spark.default.parallelism|. Also,
>     |spark.sql.shuffle.parallelism| isn’t used here since there’s no
>     shuffle involved (we only need to sort within a partition).
>
>     Default value of |mapred.map.tasks| is 2
>     <https://hadoop.apache.org/docs/r1.0.4/mapred-default.html>. You
>     may see that the Spark SQL result can be divided into two sorted
>     parts from the middle.
>
>     Cheng
>
>     On 2/19/15 10:33 AM, Kannan Rajah wrote:
>
>>     According to hive documentation, "sort by" is supposed to order the results
>>     for each reducer. So if we set a single reducer, then the results should be
>>     sorted, right? But this is not happening. Any idea why? Looks like the
>>     settings I am using to restrict the number of reducers is not having an
>>     effect.
>>
>>     *Tried the following:*
>>
>>     Set spark.default.parallelism to 1
>>
>>     Set spark.sql.shuffle.partitions to 1
>>
>>     These were set in hive-site.xml and also inside spark shell.
>>
>>
>>     *Spark-SQL*
>>
>>     create table if not exists testSortBy (key int, name string, age int);
>>     LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE
>>     testSortBy;
>>     select * from testSortBY;
>>
>>     1    Aditya    28
>>     2    aash    25
>>     3    prashanth    27
>>     4    bharath    26
>>     5    terry    27
>>     6    nanda    26
>>     7    pradeep    27
>>     8    pratyay    26
>>
>>
>>     set spark.default.parallelism=1;
>>
>>     set spark.sql.shuffle.partitions=1;
>>
>>     select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth
>>     27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age
>>     from testSortBy sort by age;
>>
>>     aash    25
>>     bharath    26
>>     nanda    26
>>     pratyay    26
>>     prashanth    27
>>     terry    27
>>     pradeep    27
>>     Aditya    28
>>
>>
>>     --
>>     Kannan
>>
>     
>
>

Re: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

Posted by Kannan Rajah <kr...@maprtech.com>.

Cheng, We tried this setting and it still did not help. This was on Spark
1.2.0.


--
Kannan

On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian <li...@gmail.com> wrote:

>  (Move to user list.)
>
> Hi Kannan,
>
> You need to set mapred.map.tasks to 1 in hive-site.xml. The reason is this
> line of code
> <https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68>,
> which overrides spark.default.parallelism. Also,
> spark.sql.shuffle.parallelism isn’t used here since there’s no shuffle
> involved (we only need to sort within a partition).
>
> Default value of mapred.map.tasks is 2
> <https://hadoop.apache.org/docs/r1.0.4/mapred-default.html>. You may see
> that the Spark SQL result can be divided into two sorted parts from the
> middle.
>
> Cheng
>
> On 2/19/15 10:33 AM, Kannan Rajah wrote:
>
>   According to hive documentation, "sort by" is supposed to order the results
> for each reducer. So if we set a single reducer, then the results should be
> sorted, right? But this is not happening. Any idea why? Looks like the
> settings I am using to restrict the number of reducers is not having an
> effect.
>
> *Tried the following:*
>
> Set spark.default.parallelism to 1
>
> Set spark.sql.shuffle.partitions to 1
>
> These were set in hive-site.xml and also inside spark shell.
>
>
> *Spark-SQL*
>
> create table if not exists testSortBy (key int, name string, age int);
> LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE
> testSortBy;
> select * from testSortBY;
>
> 1    Aditya    28
> 2    aash    25
> 3    prashanth    27
> 4    bharath    26
> 5    terry    27
> 6    nanda    26
> 7    pradeep    27
> 8    pratyay    26
>
>
> set spark.default.parallelism=1;
>
> set spark.sql.shuffle.partitions=1;
>
> select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth
> 27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age
> from testSortBy sort by age;
>
> aash    25
> bharath    26
> nanda    26
> pratyay    26
> prashanth    27
> terry    27
> pradeep    27
> Aditya    28
>
>
> --
> Kannan
>
>
>   
>

Re: Spark-SQL 1.2.0 "sort by" results are not consistent with Hive

Posted by Cheng Lian <li...@gmail.com>.

(Move to user list.)

Hi Kannan,

You need to set |mapred.map.tasks| to 1 in hive-site.xml. The reason is 
this line of code 
<https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68>, 
which overrides |spark.default.parallelism|. Also, 
|spark.sql.shuffle.parallelism| isn’t used here since there’s no shuffle 
involved (we only need to sort within a partition).

Default value of |mapred.map.tasks| is 2 
<https://hadoop.apache.org/docs/r1.0.4/mapred-default.html>. You may see 
that the Spark SQL result can be divided into two sorted parts from the 
middle.

Cheng

On 2/19/15 10:33 AM, Kannan Rajah wrote:

> According to hive documentation, "sort by" is supposed to order the results
> for each reducer. So if we set a single reducer, then the results should be
> sorted, right? But this is not happening. Any idea why? Looks like the
> settings I am using to restrict the number of reducers is not having an
> effect.
>
> *Tried the following:*
>
> Set spark.default.parallelism to 1
>
> Set spark.sql.shuffle.partitions to 1
>
> These were set in hive-site.xml and also inside spark shell.
>
>
> *Spark-SQL*
>
> create table if not exists testSortBy (key int, name string, age int);
> LOAD DATA LOCAL INPATH '/home/mapr/sample-name-age.txt' OVERWRITE INTO TABLE
> testSortBy;
> select * from testSortBY;
>
> 1    Aditya    28
> 2    aash    25
> 3    prashanth    27
> 4    bharath    26
> 5    terry    27
> 6    nanda    26
> 7    pradeep    27
> 8    pratyay    26
>
>
> set spark.default.parallelism=1;
>
> set spark.sql.shuffle.partitions=1;
>
> select name,age from testSortBy sort by age; aash 25 bharath 26 prashanth
> 27 Aditya 28 nanda 26 pratyay 26 terry 27 pradeep 27 *HIVE* select name,age
> from testSortBy sort by age;
>
> aash    25
> bharath    26
> nanda    26
> pratyay    26
> prashanth    27
> terry    27
> pradeep    27
> Aditya    28
>
>
> --
> Kannan
>