You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Armbrust (JIRA)" <ji...@apache.org> on 2015/05/20 21:06:59 UTC

[jira] [Resolved] (SPARK-7564) performance bottleneck in SparkSQL using columnar storage

     [ https://issues.apache.org/jira/browse/SPARK-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Armbrust resolved SPARK-7564.
-------------------------------------
    Resolution: Duplicate

> performance bottleneck in SparkSQL using columnar storage
> ---------------------------------------------------------
>
>                 Key: SPARK-7564
>                 URL: https://issues.apache.org/jira/browse/SPARK-7564
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.1, 1.3.1
>         Environment: 3 node cluster, each with 100g RAM and 40 cores
>            Reporter: Noam Barkai
>         Attachments: worker profiling showing the bottle-neck.png
>
>
> query over a table that's fully cached in memory, coming from columnar storage creates surprisingly slow performance. The query is a simple SELECT over a 10Gb table that sits comfortably in memory (Storage tab in Spark UI affirms this). All operations are over memory, no shuffle is taking place (again, seen via Spark UI).
> Looking at profiling it seems almost all worker threads are in one of two states:
> 1) either trying to acquire an instance of a Kryo serializer from the pool in SparkSqlSerializer, like so:
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:361)
> com.twitter.chill.ResourcePool.borrow(ResourcePool.java:35)
> org.apache.spark.sql.execution.SparkSqlSerializer$.acquireRelease(SparkSqlSerializer.scala:82)
> ...
> org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$9$$anonfun$14$$anon$2.next(InMemoryColumnarTableScan.scala:279)
> 2) or trying to release one:
> java.util.concurrent.ArrayBlockingQueue.offer(ArrayBlockingQueue.java:298)
> com.twitter.chill.ResourcePool.release(ResourcePool.java:50)
> org.apache.spark.sql.execution.SparkSqlSerializer$.acquireRelease(SparkSqlSerializer.scala:86)
> ...
> org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$9$$anonfun$14$$anon$2.next(InMemoryColumnarTableScan.scala:279)
> Issue appears when caching is done for data coming from columnar storage - I was able to reproduce this using both ORC and Parquet.
> When data is loaded from a parallel tsv text file issue does not occur.
> It seems to be related to de/serialization calls done via InMemoryColumnarTableScan. 
> The code I'm using (running from Spark shell):
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> hiveContext.sql("CACHE TABLE cached_tbl AS SELECT * FROM tbl1 ORDER BY col1").collect()
> hiveContext.sql("select col1, col2, col3 from cached_tbl").collect
> It seems that possibly the usage of KryoResourcePool in SparkSqlSerializer causes contention in the underlying ArrayBlockingQueue. A possible fix might be to replace this data structure with something more "multi-thread friendly"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org