You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Dongjoon Hyun (JIRA)" <ji...@apache.org> on 2018/11/23 17:57:00 UTC

[jira] [Comment Edited] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

    [ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16697410#comment-16697410 ] 

Dongjoon Hyun edited comment on SPARK-20144 at 11/23/18 5:56 PM:
-----------------------------------------------------------------

Sorry, [~darabos]. IMHO, the proposed way is not consistent with the existing Apache Spark design choice. Also, it's not robust enough to be a part of Apache Spark because it misleads the user without the guarantee on sort-ness always. Lastly, it causes performance degradation because it may try to open many small files first. I think you had better add your patch into your Spark build if you have.


was (Author: dongjoon):
Sorry, [~darabos]. IMHO, the proposed way is not consistent with the existing Apache Spark design choice. Also, it's not robust enough to be a part of Apache Spark because it misleads the user without the guarantee on sort-ness always. Lastly, it causes performance degradation because it may try open many small files first. I think you had better add your patch into your Spark build if you have.

> spark.read.parquet no long maintains ordering of the data
> ---------------------------------------------------------
>
>                 Key: SPARK-20144
>                 URL: https://issues.apache.org/jira/browse/SPARK-20144
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Li Jin
>            Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is when we read parquet files in 2.0.2, the ordering of rows in the resulting dataframe is not the same as the ordering of rows in the dataframe that the parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into fewer partitions and also reordered them. This breaks our workflows because they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org