You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by RodrigoB <ro...@aspect.com> on 2016/09/29 18:50:45 UTC

Running in local mode as SQL engine - what to optimize?

Hi all,

For several reasons which I won't elaborate (yet), we're using Spark on
local mode as an in memory SQL engine for data we're retrieving from
Cassandra, execute SQL queries and return to the client - so no cluster, no
worker nodes. I'm well aware local mode has always been considered a testing
mode, but it does fit our purposes at the moment....

We're on Spark 2.0.0

I'm finding several challenges which I would like to get some comments if
possible:

1 - For group by based SQL queries I'm finding shuffle disk spills to
constantly happen, to a point where after a couple of days I have 9GB of
disk filled in the block manager folder with broadcast files. My
understanding is that disk spills only occur during the lifetime of an RDD.
Once the RDD is gone from memory, so should the files, this doesn't seem to
be happening. Is there any way of completely disable the disk spills? I've
tweaked the memory fraction configuration to maximize execution memory and
avoid the disk spills but doesn't seem to have done much to avoid the
spills...

2 - GC overhead is overwhelming - when refreshing an Dataframe (even empty
data!) and executing 1 group by queries every second on around 1MB of data,
the amount of Young Gen used goes up to 2GB every 10 seconds. I've started
profiling the JVM and can find considerable number of hashmap objects which
I assume are created internally in Spark.

3 - I'm really looking for low latency multithreaded refreshes and
collection of data frames - in order of milliseconds of query execution and
collection of data within this local node, and I'm afraid goes against the
nature of spark. Spark partitions all data s blocks and uses the scheduler
for its multi-node design, and that's great for multi-node execution. For a
local node execution adds considerable overhead, and I'm aware of this
constraint, the hope is that we could optimize it to do the point where this
kind of usage becomes a possibility - in memory efficient SQL engine within
the same JVM where the data lives. Any suggestions are very welcomed!

Thanks in advance,
Rod

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-in-local-mode-as-SQL-engine-what-to-optimize-tp27815.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org