You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dooyoung Hwang (JIRA)" <ji...@apache.org> on 2018/08/24 10:48:00 UTC

[jira] [Updated] (SPARK-25224) Improvement of Spark SQL ThriftServer memory management

     [ https://issues.apache.org/jira/browse/SPARK-25224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dooyoung Hwang updated SPARK-25224:
-----------------------------------
    Description: 
Spark SQL just have two options for managing thriftserver memory - enable spark.sql.thriftServer.incrementalCollect or not

*1. The case of enabling spark.sql.thriftServer.incrementalCollects*
*1) Pros :* thriftserver can handle large output without OOM.

*2) Cons*
 * Performance degradation because of executing task partition by partition.
 * Handle queries with count-limit inefficiently because of executing all partitions. (executeTake stop scanning after collecting count-limit.)
 * Cannot cache result for FETCH_FIRST

*2. The case of disabling spark.sql.thriftServer.incrementalCollects*

*1) Pros :* Good performance for small output

*2) Cons*
 * Memory peak usage is too large because allocating decompressed & deserialized rows in "batch" manner, and OOM could occur for large output.
 * It is difficult to measure memory peak usage of Query, so configuring spark.driver.maxResultSize is very difficult.
 * If decompressed & deserialized rows fills up eden area of JVM Heap, they moves to old Gen and could increase possibility of "Full GC" that stops the world.

 

The improvement idea is below:
 # *DataSet does not decompress & deserialize result, and just return total row count & iterator to SQL-Executor.* By doing that, only uncompressed data reside in memory, so that the memory usage is not only much lower than before but is configurable with using spark.driver.maxResultSize.
 # *After SQL-Executor get total row count & iterator from DataSet, it could decide whether collecting them as batch manner(appropriate for small row count) or deserializing and sending them iteratively (appropriate for large row count) with considering returned row count.*

  was:
Spark SQL just have two options for managing thriftserver memory - enable spark.sql.thriftServer.incrementalCollect or not
 # *The case of enabling spark.sql.thriftServer.incrementalCollects*
*1)Pros*
  - thriftserver can handle large output without OOM.

*2) Cons*
 - Performance degradation because of executing task partition by partition.
 - Handle queries with count-limit inefficiently because of executing all partitions. (executeTake stop scanning after collecting count-limit.)
- Cannot cache result for FETCH_FIRST


 # *The case of disabling spark.sql.thriftServer.incrementalCollects*
*1) Pros*
 - Good performance for small output

*2) Cons*
- Memory peak usage is too large because allocating decompressed & deserialized rows in "batch" manner, and OOM could occur for large output.
- It is difficult to measure memory peak usage of Query, so configuring spark.driver.maxResultSize is very difficult.
- If decompressed & deserialized rows fills up eden area of JVM Heap, they moves to old Gen and could increase possibility of "Full GC" that stops the world.

 

The improvement idea is below:
 # *DataSet does not decompress & deserialize result, and just return total row count & iterator to SQL-Executor.* By doing that, only uncompressed data reside in memory, so that the memory usage is not only much lower than before but is configurable with using spark.driver.maxResultSize.


 # *After SQL-Executor get total row count & iterator from DataSet, it could decide whether collecting them as batch manner(appropriate for small row count) or deserializing and sending them iteratively (appropriate for large row count) with considering returned row count.*


> Improvement of Spark SQL ThriftServer memory management
> -------------------------------------------------------
>
>                 Key: SPARK-25224
>                 URL: https://issues.apache.org/jira/browse/SPARK-25224
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Dooyoung Hwang
>            Priority: Major
>
> Spark SQL just have two options for managing thriftserver memory - enable spark.sql.thriftServer.incrementalCollect or not
> *1. The case of enabling spark.sql.thriftServer.incrementalCollects*
> *1) Pros :* thriftserver can handle large output without OOM.
> *2) Cons*
>  * Performance degradation because of executing task partition by partition.
>  * Handle queries with count-limit inefficiently because of executing all partitions. (executeTake stop scanning after collecting count-limit.)
>  * Cannot cache result for FETCH_FIRST
> *2. The case of disabling spark.sql.thriftServer.incrementalCollects*
> *1) Pros :* Good performance for small output
> *2) Cons*
>  * Memory peak usage is too large because allocating decompressed & deserialized rows in "batch" manner, and OOM could occur for large output.
>  * It is difficult to measure memory peak usage of Query, so configuring spark.driver.maxResultSize is very difficult.
>  * If decompressed & deserialized rows fills up eden area of JVM Heap, they moves to old Gen and could increase possibility of "Full GC" that stops the world.
>  
> The improvement idea is below:
>  # *DataSet does not decompress & deserialize result, and just return total row count & iterator to SQL-Executor.* By doing that, only uncompressed data reside in memory, so that the memory usage is not only much lower than before but is configurable with using spark.driver.maxResultSize.
>  # *After SQL-Executor get total row count & iterator from DataSet, it could decide whether collecting them as batch manner(appropriate for small row count) or deserializing and sending them iteratively (appropriate for large row count) with considering returned row count.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org