You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "George George (Jira)" <ji...@apache.org> on 2020/05/04 08:23:00 UTC

[jira] [Created] (SPARK-31635) Spark SQL Sort fails when sorting big data points

George George created SPARK-31635:
-------------------------------------

             Summary: Spark SQL Sort fails when sorting big data points
                 Key: SPARK-31635
                 URL: https://issues.apache.org/jira/browse/SPARK-31635
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.3.2
            Reporter: George George


 

 Please have a look at the example below: 
{code:java}
 
case class Point(x:Double, y:Double)
case class Nested(a: Long, b: Seq[Point])
val test = spark.sparkContext.parallelize((1L to 100L).map(a => Nested(a,Seq.fill[Point](250000)(Point(1,2)))), 100)
test.toDF().as[Nested].sort("a").take(1)
{code}
 *Sorting* big data objects using Spark Dataframe is failing with following exception: 
{code:java}
2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized results of 14 tasks (107.8 MB) is bigger than spark.driver.maxResultSize (100.0 MB)
[Stage 0:======>                                                 (12 + 3) / 100]org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 13 tasks (100.1 MB) is bigger than spark.driver.maxResu
{code}
However using the *RDD API* is working and no exception is thrown: 

 
{code:java}
case class Point(x:Double, y:Double)
case class Nested(a: Long, b: Seq[Point])
val test = spark.sparkContext.parallelize((1L to 100L).map(a => Nested(a,Seq.fill[Point](250000)(Point(1,2)))), 100)
test.sortBy(_.a).take(1)
{code}
For both code snippets we started the spark shell with exactly the same arguments:

 

 
{code:java}
spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=100MB"
{code}
 

 

Even if we increase the spark.driver.maxResultSize, the executors still get killed. The interesting thing is that when using the RDD API directly the problem is not there. *Looks like there is a bug in dataframe sort because is shuffling to much data to the driver?* 

 

Note this is a small example and I reduced the spark.driver.maxResultSize to a smaller size, but in our application I've tried setting it to 8GB but as mentioned above the job was killed. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org