You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Li Jin (JIRA)" <ji...@apache.org> on 2017/03/31 14:15:41 UTC

[jira] [Comment Edited] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

    [ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15950979#comment-15950979 ] 

Li Jin edited comment on SPARK-20144 at 3/31/17 2:14 PM:
---------------------------------------------------------

Thanks for getting back to me.

Sorting in this case will just add extra cost to in our workflow and we are trying to avoid it in the first place.

Because DataFrame presents the data in a tabular format, it is very surprising that the ordering of rows in the table changes after going through hdfs. In any other tabular format that I know of, ordering of rows is a property of the data and it is surprising that reading/writing changes properties of the data. This is also a bit scary because if ordering were not a property of a DataFrame, can things like cache or select("col") change ordering of rows in the future? 



was (Author: icexelloss):
Thanks for getting back to me.

Sorting in this case will just add extra cost to in our workflow and we are trying to avoid it in the first place.

Because DataFrame presents the data in a tabular format, it is very surprising that the table changes after going through hdfs. In any other tabular format that I know of, ordering of rows is a property of the data and it is surprising that reading/writing changes properties of the data. This is also a bit scary because if ordering were not a property of a DataFrame, can things like cache or select("col") change ordering of rows in the future? 


> spark.read.parquet no long maintains ordering of the data
> ---------------------------------------------------------
>
>                 Key: SPARK-20144
>                 URL: https://issues.apache.org/jira/browse/SPARK-20144
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is when we read parquet files in 2.0.2, the ordering of rows in the resulting dataframe is not the same as the ordering of rows in the dataframe that the parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into fewer partitions and also reordered them. This breaks our workflows because they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org