You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Li Jin (JIRA)" <ji...@apache.org> on 2017/03/29 14:20:41 UTC

[jira] [Created] (SPARK-20144) spark.read.parquet no long maintains the ordering the the data

Li Jin created SPARK-20144:
------------------------------

             Summary: spark.read.parquet no long maintains the ordering the the data
                 Key: SPARK-20144
                 URL: https://issues.apache.org/jira/browse/SPARK-20144
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.2
            Reporter: Li Jin


Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is when we read parquet files in 2.0.2, the ordering of rows in the resulting dataframe is not the same as the ordering of rows in the dataframe that the parquet file was reproduced with. This is because FileSourceStrategy.scala combines the parquet files into fewer partitions and also reordered them. This breaks our workout because they assume the ordering of the data. Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org