You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/18 08:10:21 UTC

[GitHub] [spark] EnricoMi opened a new pull request, #38296: [SPARK-40830][SQL] Provide groupByKey shurtcuts for groupBy.as

EnricoMi opened a new pull request, #38296:
URL: https://github.com/apache/spark/pull/38296

   ### What changes were proposed in this pull request?
   This documents that `groupBy(...).as[...]` should be preferred over `groupByKey(...)`.
   
   It further provides shortcuts for `groupBy(...).as[...]` that make it easier to use column-based `groupByKey`.
   
   ### Why are the changes needed?
   
   Calling `Dataset.groupBy(...).as[K, T]` should be preferred over calling `Dataset.groupByKey(...)` whenever possible. The former allows Catalyst to exploit existing partitioning and ordering of the Dataset, while the latter hides from Catalyst which columns are used to create the keys.
   
   _When the dataset is already partitioned and ordered by the grouping columns, `Dataset.groupByKey(...)` will repartition and order the entire dataset again._
   
   Example:
   
   Calling `ds.groupByKey(_.id)` hides from Catalyst that column `id` is the grouping key, while `ds.groupBy($"id").as[Int, V]` tells Catalyst that `ds` is to be grouped by (partitioned and ordered by) column `id`.
   
   The new column-based `groupByKey` methods make it easier for users to find a way to express the grouping by expressions. Looking at the `Dataset` API, the user finds `groupByKey` with `Column`. The existing `groupBy` method returns a `RelationalGroupedDataset`, which provides the `as[K, V]` method, which allows for the same semantics, but is difficult to find.
   
   The new column-based `groupByKey` methods further do not require the user to specify the type `V` of the original `Dataset[V]`, as `groupByKey` has access to the type / encoder:
   
       ds.groupBy($"id").as[Int, V]
   
   vs.
   
       ds.groupByKey[Int]($"id")
   
   ### Does this PR introduce _any_ user-facing change?
   
   **Note: This breaks compilation when function name is used with existing `groupByKey`:**
   
   Was:
   
       groupByKey(identity)
   
   Now:
   
       groupByKey(identity(_))
   
   It adds these methods:
   
   Scala:
   - `Dataset.groupByKey[K](Column*)`
   - `Dataset.groupByKey[K](String, String*)`
   
   Java:
   - `Dataset.groupByKey[K](Encoder[K], Column...)`
   - `Dataset.groupByKey[K](Encoder[K], String, String...)`
   
   
   ### How was this patch tested?
   Adds tests to `DatasetSuite` and `JavaDatasetSuite`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #38296: [SPARK-40830][SQL] Provide groupByKey shurtcuts for groupBy.as

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on PR #38296:
URL: https://github.com/apache/spark/pull/38296#issuecomment-1282039598

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #38296: [SPARK-40830][SQL] Provide groupByKey shortcuts for groupBy.as

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #38296:
URL: https://github.com/apache/spark/pull/38296#issuecomment-1613947428

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #38296: [SPARK-40830][SQL] Provide groupByKey shortcuts for groupBy.as

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed pull request #38296: [SPARK-40830][SQL] Provide groupByKey shortcuts for groupBy.as
URL: https://github.com/apache/spark/pull/38296


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org