You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ilya Ganelin (JIRA)" <ji...@apache.org> on 2015/07/15 20:17:05 UTC

[jira] [Commented] (SPARK-8890) Reduce memory consumption for dynamic partition insert

    [ https://issues.apache.org/jira/browse/SPARK-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628488#comment-14628488 ] 

Ilya Ganelin commented on SPARK-8890:
-------------------------------------

[~rxin] I want to make sure I correctly understand your solution. Are you proposing that if the number of active partitions is beyond 50 we repartition the data into 50 partitions? 

I think we could approach this differently by creating a pool of OutputWriters (of size 50) and only create new OutputWriters once the previous partition has been written. This could be handled by blocking within the outputWriterForRow call when the new outputWriter is created. 

Does that seem reasonable? Please let me know, thanks!

> Reduce memory consumption for dynamic partition insert
> ------------------------------------------------------
>
>                 Key: SPARK-8890
>                 URL: https://issues.apache.org/jira/browse/SPARK-8890
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Reynold Xin
>            Priority: Critical
>
> Currently, InsertIntoHadoopFsRelation can run out of memory if the number of table partitions is large. The problem is that we open one output writer for each partition, and when data are randomized and when the number of partitions is large, we open a large number of output writers, leading to OOM.
> The solution here is to inject a sorting operation once the number of active partitions is beyond a certain point (e.g. 50?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org