You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Cesar Flores <ce...@gmail.com> on 2015/07/02 16:44:46 UTC

Dataframe in single partition after sorting?

I am sorting a data frame using something like:

val sortedDF = df.orderBy(df("score").desc)

The sorting is really fast. The issue I have is that after sorting, the
resulting data frame sortedDF appears to be in a single partition, which is
a problem because when I try to execute another operation in this new data
frame (i.e sortedDF.limit(1000000)) I have an error like the following:

Job aborted due to stage failure: Total size of serialized results of 194
tasks (5.0 GB) is bigger than spark.driver.maxResultSize (5.0 GB)

I have already tried to repartition the resulting sortedDF before doing any
operation on it, but the same error appears.

*Is there any smarter way to use dataframe orderBy on Spark, such that I do
not have this problem?*


The current version of spark I am using is 1.3.0, and due to company policy
it is not possible for me to try it in a newer version.



Thanks!!!
-- 
Cesar Flores