You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "koert kuipers (JIRA)" <ji...@apache.org> on 2016/06/07 05:04:21 UTC

[jira] [Created] (SPARK-15798) Secondary sort in Dataset/DataFrame

koert kuipers created SPARK-15798:
-------------------------------------

             Summary: Secondary sort in Dataset/DataFrame
                 Key: SPARK-15798
                 URL: https://issues.apache.org/jira/browse/SPARK-15798
             Project: Spark
          Issue Type: New Feature
          Components: SQL
            Reporter: koert kuipers


Secondary sort for Spark RDDs was discussed in https://issues.apache.org/jira/browse/SPARK-3655
Since the RDD API allows for easy extensions outside the core library this was implemented separately here:
https://github.com/tresata/spark-sorted

However it seems to me that with Dataset an implementation in a 3rd party library of such a feature is not really an option.

Dataset already has methods that suggest a secondary sort is present, such as in KeyValueGroupedDataset:
{noformat}
def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): Dataset[U]
{noformat}
This operation pushes all the data to the reducer, something you only would want to do if you need the elements in a particular order.

How about as an API sortBy methods in KeyValueGroupedDataset and RelationalGroupedDataset?
{noformat}
dataFrame.groupBy("a").sortBy("b").fold(...)
{noformat}
(yes i know RelationalGroupedDataset doesnt have a fold yet... but it should :))
{noformat}
dataset.groupBy(_._1).sortBy(_._3).flatMapGroups(...)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org