You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Dooyoung Hwang (JIRA)" <ji...@apache.org> on 2018/08/24 10:44:00 UTC

[jira] [Created] (SPARK-25224) Improvement of Spark SQL ThriftServer memory management

Dooyoung Hwang created SPARK-25224:
--------------------------------------

             Summary: Improvement of Spark SQL ThriftServer memory management
                 Key: SPARK-25224
                 URL: https://issues.apache.org/jira/browse/SPARK-25224
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.3.1
            Reporter: Dooyoung Hwang


Spark SQL just have two options for managing thriftserver memory - enable spark.sql.thriftServer.incrementalCollect or not
 # *The case of enabling spark.sql.thriftServer.incrementalCollects*
*1)Pros*
  - thriftserver can handle large output without OOM.

*2) Cons*
 - Performance degradation because of executing task partition by partition.
 - Handle queries with count-limit inefficiently because of executing all partitions. (executeTake stop scanning after collecting count-limit.)
- Cannot cache result for FETCH_FIRST


 # *The case of disabling spark.sql.thriftServer.incrementalCollects*
*1) Pros*
 - Good performance for small output

*2) Cons*
- Memory peak usage is too large because allocating decompressed & deserialized rows in "batch" manner, and OOM could occur for large output.
- It is difficult to measure memory peak usage of Query, so configuring spark.driver.maxResultSize is very difficult.
- If decompressed & deserialized rows fills up eden area of JVM Heap, they moves to old Gen and could increase possibility of "Full GC" that stops the world.

 

The improvement idea is below:
 # *DataSet does not decompress & deserialize result, and just return total row count & iterator to SQL-Executor.* By doing that, only uncompressed data reside in memory, so that the memory usage is not only much lower than before but is configurable with using spark.driver.maxResultSize.


 # *After SQL-Executor get total row count & iterator from DataSet, it could decide whether collecting them as batch manner(appropriate for small row count) or deserializing and sending them iteratively (appropriate for large row count) with considering returned row count.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org