You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Bill (JIRA)" <ji...@apache.org> on 2017/05/04 10:32:04 UTC

[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

    [ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996517#comment-15996517 ] 

Bill commented on SPARK-20144:
------------------------------

Increasing {{spark.sql.files.openCostInBytes}} prevents the individual parquet files from being combined, but it does not prevent them from being reordered: 
[DataSourceScanExec - spark 2.1.0|https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L499]

It seems somewhat inconsistent to respect the within-file ordering of a file-based source (parquet in this case), but not the between-file ordering, even though the order of files can also convey information.

Does it make sense to include an option that tells spark to also respect the file order for a file-based source when constructing the partition list? Multiple small files could still be combined into the same partition for efficiency, but this would allow the user to tell spark that the order of files matters and should be kept (within and between partitions). 

> spark.read.parquet no long maintains ordering of the data
> ---------------------------------------------------------
>
>                 Key: SPARK-20144
>                 URL: https://issues.apache.org/jira/browse/SPARK-20144
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is when we read parquet files in 2.0.2, the ordering of rows in the resulting dataframe is not the same as the ordering of rows in the dataframe that the parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into fewer partitions and also reordered them. This breaks our workflows because they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org