You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Chris Rogers (JIRA)" <ji...@apache.org> on 2017/03/07 21:53:38 UTC

[jira] [Comment Edited] (SPARK-16207) order guarantees for DataFrames

    [ https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900220#comment-15900220 ] 

Chris Rogers edited comment on SPARK-16207 at 3/7/17 9:52 PM:
--------------------------------------------------------------

[~srowen] since there is no documentation yet, I don't know whether a clear, coherent generalization can be made.  I would be happy with "most of the methods DO NOT preserve order, with these specific exceptions", or "most of the methods DO preserve order, with these specific exceptions".

Failing a generalization, I'd also be happy with method-by-method documentation of ordering semantics, which seems like a very minimal amount of copy-pasting ("Preserves ordering: yes", "Preserves ordering: no").  Maybe that's a good place to start, since there seems to be some confusion about what the generalization would be.

I'm new to Scala so not sure if this is practical, but maybe the appropriate methods could be moved to an `RDDPreservesOrdering` class with an implicit conversion, akin to `PairRDDFunctions`?


was (Author: rcrogers):
[~srowen] since there is no documentation yet, I don't know whether a clear, coherent generalization can be made.  I would be happy with "most of the methods DO NOT preserve order, with these specific exceptions", or "most of the methods DO preserve order, with these specific exceptions".

Failing a generalization, I'd also be happy with method-by-method documentation of ordering semantics, which seems like a very minimal amount of copy-pasting ("Preserves ordering: yes", "Preserves ordering: no").  Maybe that's a good place to start, since there seems to be some confusion about what the generalization would be.

> order guarantees for DataFrames
> -------------------------------
>
>                 Key: SPARK-16207
>                 URL: https://issues.apache.org/jira/browse/SPARK-16207
>             Project: Spark
>          Issue Type: Documentation
>          Components: Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Max Moroz
>            Priority: Minor
>
> There's no clear explanation in the documentation about what guarantees are available for the preservation of order in DataFrames. Different blogs, SO answers, and posts on course websites suggest different things. It would be good to provide clarity on this.
> Examples of questions on which I could not find clarification:
> 1) Does groupby() preserve order?
> 2) Does take() preserve order?
> 3) Is DataFrame guaranteed to have the same order of lines as the text file it was read from? (Or as the json file, etc.)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org