You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2016/11/02 05:57:58 UTC

[jira] [Comment Edited] (SPARK-15420) Repartition and sort before Parquet writes

    [ https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15627859#comment-15627859 ] 

Reynold Xin edited comment on SPARK-15420 at 11/2/16 5:57 AM:
--------------------------------------------------------------

Ryan I looked at this just now (sorry not looking earlier). I recently refactored this part of the code so I have a pretty good idea about what's going on with this. I think a simpler solution is to move the sort out of the writer, and then use the planner to inject sorting and exchanges necessary. It will be more obvious in the explain plan, and also allows us to reuse existing operator code. WDYT?

Also why do we need DataFrameWriter to expose partitioning? Couldn't that be done via DataFrame.repartition function itself before calling the writer?



was (Author: rxin):
Ryan I looked at this just now (sorry not looking earlier). I recently refactored this part of the code so I have a pretty good idea about what's going on with this. I think a simpler solution is to move the sort out of the writer, and then use the planner to inject sorting and exchanges necessary. It will be more obvious in the explain plan, and also allows us to reuse existing operator code. WDYT?



> Repartition and sort before Parquet writes
> ------------------------------------------
>
>                 Key: SPARK-15420
>                 URL: https://issues.apache.org/jira/browse/SPARK-15420
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1
>            Reporter: Ryan Blue
>
> Parquet requires buffering data in memory before writing a group of rows organized by column. This causes significant memory pressure when writing partitioned output because each open file must buffer rows.
> Currently, Spark will sort data and spill if necessary in the {{WriterContainer}} to avoid keeping many files open at once. But, this isn't a full solution for a few reasons:
> * The final sort is always performed, even if incoming data is already sorted correctly. For example, a global sort will cause two sorts to happen, even if the global sort correctly prepares the data.
> * To prevent a large number of output small output files, users must manually add a repartition step. That step is also ignored by the sort within the writer.
> * Hive does not currently support {{DataFrameWriter#sortBy}}
> The sort in {{WriterContainer}} makes sense to prevent problems, but should detect if the incoming data is already sorted. The {{DataFrameWriter}} should also expose the ability to repartition data before the write stage, and the query planner should expose an option to automatically insert repartition operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org