You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Alex Baretta (JIRA)" <ji...@apache.org> on 2015/01/19 08:08:34 UTC

[jira] [Commented] (SPARK-5314) java.lang.OutOfMemoryError in SparkSQL with GROUP BY

    [ https://issues.apache.org/jira/browse/SPARK-5314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14282213#comment-14282213 ] 

Alex Baretta commented on SPARK-5314:
-------------------------------------

Per Akhil's comment on the dev list, "SET spark.sql.shuffle.partitions=1024" resolves the OOM issue. I wonder if a more robust solution could be found.

> java.lang.OutOfMemoryError in SparkSQL with GROUP BY
> ----------------------------------------------------
>
>                 Key: SPARK-5314
>                 URL: https://issues.apache.org/jira/browse/SPARK-5314
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Alex Baretta
>
> I am running a SparkSQL GROUP BY query on a largish Parquet table (a few hundred million rows), weighing it at about 50GB. My cluster has 1.7 TB of RAM, so it should have more than plenty resources to cope with this query.
> WARN TaskSetManager: Lost task 279.0 in stage 22.0 (TID 1229, ds-model-w-21.c.eastern-gravity-771.internal): java.lang.OutOfMemoryError: GC overhead limit exceeded
>         at scala.collection.SeqLike$class.distinct(SeqLike.scala:493)
>         at scala.collection.AbstractSeq.distinct(Seq.scala:40)
>         at org.apache.spark.sql.catalyst.expressions.Coalesce.resolved$lzycompute(nullFunctions.scala:33)
>         at org.apache.spark.sql.catalyst.expressions.Coalesce.resolved(nullFunctions.scala:33)
>         at org.apache.spark.sql.catalyst.expressions.Coalesce.dataType(nullFunctions.scala:37)
>         at org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:100)
>         at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:101)
>         at org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:50)
>         at org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:81)
>         at org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:571)
>         at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167)
>         at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
>         at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
>         at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615)
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>         at org.apache.spark.scheduler.Task.run(Task.scala:56)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org