You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:00:32 UTC

[jira] [Updated] (SPARK-15798) Secondary sort in Dataset/DataFrame

     [ https://issues.apache.org/jira/browse/SPARK-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-15798:
---------------------------------
    Labels: bulk-closed  (was: )

> Secondary sort in Dataset/DataFrame
> -----------------------------------
>
>                 Key: SPARK-15798
>                 URL: https://issues.apache.org/jira/browse/SPARK-15798
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: koert kuipers
>            Priority: Major
>              Labels: bulk-closed
>
> Secondary sort for Spark RDDs was discussed in https://issues.apache.org/jira/browse/SPARK-3655
> Since the RDD API allows for easy extensions outside the core library this was implemented separately here:
> https://github.com/tresata/spark-sorted
> However it seems to me that with Dataset an implementation in a 3rd party library of such a feature is not really an option.
> Dataset already has methods that suggest a secondary sort is present, such as in KeyValueGroupedDataset:
> {noformat}
> def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): Dataset[U]
> {noformat}
> This operation pushes all the data to the reducer, something you only would want to do if you need the elements in a particular order.
> How about as an API sortBy methods in KeyValueGroupedDataset and RelationalGroupedDataset?
> {noformat}
> dataFrame.groupBy("a").sortBy("b").fold(...)
> {noformat}
> (yes i know RelationalGroupedDataset doesnt have a fold yet... but it should :))
> {noformat}
> dataset.groupBy(_._1).sortBy(_._3).flatMapGroups(...)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org