You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ran Haim (JIRA)" <ji...@apache.org> on 2016/11/13 15:22:58 UTC

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

    [ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15661640#comment-15661640 ] 

Ran Haim commented on SPARK-17436:
----------------------------------

Hi,
I only got a chance to work on it now.
I saw that the whole class tree got changed - I changed the code in org.apache.spark.sql.execution.datasources.FileFormatWriter.
The problem is I cannot seem to run a mvn clean install...A lot of tests fail (not relevant to my change, and happen without it) - And I do want to make sure there are relevant tests (though I did not find any).

Any Ideas?

Ran,

> dataframe.write sometimes does not keep sorting
> -----------------------------------------------
>
>                 Key: SPARK-17436
>                 URL: https://issues.apache.org/jira/browse/SPARK-17436
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1, 1.6.2, 2.0.0
>            Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered dataframe.
> The problem originates in org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition key, and that can sometimes mess up the original sort (or secondary sort if you will).
> I think the best way to fix it is to stop using a sorter, and just put the rows in a map using key as partition key and value as an arraylist, and then just walk through all the keys and write it in the original order - this will probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org