You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "David Greenberg (JIRA)" <ji...@apache.org> on 2019/04/11 20:14:00 UTC

[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

    [ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16815755#comment-16815755 ] 

David Greenberg commented on SPARK-20144:
-----------------------------------------

Hello, this issue is also a major one for me. Almost all of the data I work with is has a natural sort order, and I store it in CSV, parquet, and orc. Unfortunately, some of my datasets are very large, and so I waste a lot of compute time loading those datasets out of storage due to Spark throwing out serialization information at load & store time.

 

I would really like to see a solution to this problem, as it's fairly expensive to our bottom line when using spark.

> spark.read.parquet no long maintains ordering of the data
> ---------------------------------------------------------
>
>                 Key: SPARK-20144
>                 URL: https://issues.apache.org/jira/browse/SPARK-20144
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Li Jin
>            Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is when we read parquet files in 2.0.2, the ordering of rows in the resulting dataframe is not the same as the ordering of rows in the dataframe that the parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into fewer partitions and also reordered them. This breaks our workflows because they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org