You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Max Moroz (JIRA)" <ji...@apache.org> on 2016/06/25 10:38:37 UTC

[jira] [Commented] (SPARK-16207) order guarantees for DataFrames

    [ https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15349594#comment-15349594 ] 

Max Moroz commented on SPARK-16207:
-----------------------------------

Would something like this be useful in the docs? If it is,  I'll clean it up and verify, and add a PR.

A DataFrame returned by orderBy() transformation has the order determined by its arguments.
A GroupedData returned by  groupBy() transformation of an ordered DataFrame has the order of its parent within each group.
A DataFrame returned by any transformation of an ordered DataFrame that does not involve grouping or sorting (such as select() or withColumn()) has the order of its parent DataFrame.
Any DataFrame or GroupedData for which the above rules do not define an order, is unordered.

Actions take(), first(), and collect() return results in the order consistent with the DataFrame or group order if any; or in an arbitrary order if the DataFrame or group is unordered.

Example: a DataFrame created by reading a text file is unordered.
Example: a DataFrame created by df.select(explode('column')) is unordered.
Example: a DataFrame created by df.orderBy('date').groupBy('id').agg(F.first('date')) is unordered, but the aggregation will result in choosing the lowest id for each date.

> order guarantees for DataFrames
> -------------------------------
>
>                 Key: SPARK-16207
>                 URL: https://issues.apache.org/jira/browse/SPARK-16207
>             Project: Spark
>          Issue Type: Documentation
>          Components: Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Max Moroz
>            Priority: Minor
>
> There's no clear explanation in the documentation about what guarantees are available for the preservation of order in DataFrames. Different blogs, SO answers, and posts on course websites suggest different things. It would be good to provide clarity on this.
> Examples of questions on which I could not find clarification:
> 1) Does groupby() preserve order?
> 2) Does take() preserve order?
> 3) Is DataFrame guaranteed to have the same order of lines as the text file it was read from? (Or as the json file, etc.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org