You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by SLiZn Liu <sl...@gmail.com> on 2015/03/25 09:48:19 UTC

OutOfMemoryError when using DataFrame created by Spark SQL

Hi,

I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL
and DataFrame Guide
<https://spark.apache.org/docs/latest/sql-programming-guide.html> step by
step. However, my HiveQL via sqlContext.sql() fails and
java.lang.OutOfMemoryError was raised. The expected result of such query is
considered to be small (by adding limit 1000 clause). My code is shown
below:

scala> import sqlContext.implicits._
scala> val df = sqlContext.sql("""select * from some_table where
logdate="2015-03-24" limit 1000""")

and the error msg:

[ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27]
[ActorSystem(sparkDriver)] Uncaught fatal error from thread
[sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: GC overhead limit exceeded

the master heap memory is set by -Xms512m -Xmx512m, while workers set
by -Xms4096M
-Xmx4096M, which I presume sufficient for this trivial query.

Additionally, after restarted the spark-shell and re-run the limit 5 query
, the df object is returned and can be printed by df.show(), but other APIs
fails on OutOfMemoryError, namely, df.count(),
df.select("some_field").show() and so forth.

I understand that the RDD can be collected to master hence further
transmutations can be applied, as DataFrame has “richer optimizations under
the hood” and the convention from an R/julia user, I really hope this error
is able to be tackled, and DataFrame is robust enough to depend.

Thanks in advance!

REGARDS,
Todd