You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "George George (Jira)" <ji...@apache.org> on 2020/05/04 08:23:00 UTC
[jira] [Created] (SPARK-31635) Spark SQL Sort fails when sorting
big data points
George George created SPARK-31635:
-------------------------------------
Summary: Spark SQL Sort fails when sorting big data points
Key: SPARK-31635
URL: https://issues.apache.org/jira/browse/SPARK-31635
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.3.2
Reporter: George George
Please have a look at the example below:
{code:java}
case class Point(x:Double, y:Double)
case class Nested(a: Long, b: Seq[Point])
val test = spark.sparkContext.parallelize((1L to 100L).map(a => Nested(a,Seq.fill[Point](250000)(Point(1,2)))), 100)
test.toDF().as[Nested].sort("a").take(1)
{code}
*Sorting* big data objects using Spark Dataframe is failing with following exception:
{code:java}
2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized results of 14 tasks (107.8 MB) is bigger than spark.driver.maxResultSize (100.0 MB)
[Stage 0:======> (12 + 3) / 100]org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 13 tasks (100.1 MB) is bigger than spark.driver.maxResu
{code}
However using the *RDD API* is working and no exception is thrown:
{code:java}
case class Point(x:Double, y:Double)
case class Nested(a: Long, b: Seq[Point])
val test = spark.sparkContext.parallelize((1L to 100L).map(a => Nested(a,Seq.fill[Point](250000)(Point(1,2)))), 100)
test.sortBy(_.a).take(1)
{code}
For both code snippets we started the spark shell with exactly the same arguments:
{code:java}
spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=100MB"
{code}
Even if we increase the spark.driver.maxResultSize, the executors still get killed. The interesting thing is that when using the RDD API directly the problem is not there. *Looks like there is a bug in dataframe sort because is shuffling to much data to the driver?*
Note this is a small example and I reduced the spark.driver.maxResultSize to a smaller size, but in our application I've tried setting it to 8GB but as mentioned above the job was killed.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org