You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jungtaek Lim (Jira)" <ji...@apache.org> on 2022/02/17 09:05:00 UTC

[jira] [Updated] (SPARK-38237) Rename back StatefulOpClusteredDistribution to HashClusteredDistribution

     [ https://issues.apache.org/jira/browse/SPARK-38237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jungtaek Lim updated SPARK-38237:
---------------------------------
    Description: 
We still find HashClusteredDistribution be useful for batch query as well. For example, we had a case with lower parallelism than expected due to the fact ClusteredDistribution is used for aggregation which matches with HashPartitioning with sub-key groups (note that the technical parallelism also depends on "cardinality" - picking sub-key groups means having less cardinality).

We propose to rename back HashClusteredDistribution with retaining NOTE for stateful operator. The distribution should not be still touched anyway due to the requirement of stateful operator, but can be co-used with batch case if needed.

  was:
We still find HashClusteredDistribution be useful for batch query as well. For example, we had a case with lower parallelism than expected due to the fact ClusteredDistribution is used for aggregation which matches with HashPartitioning with sub-key groups (where the parallelism also depends on cardinality).

We propose to rename back HashClusteredDistribution with retaining NOTE for stateful operator. The distribution should not be still touched anyway due to the requirement of stateful operator, but can be co-used with batch case if needed.


> Rename back StatefulOpClusteredDistribution to HashClusteredDistribution
> ------------------------------------------------------------------------
>
>                 Key: SPARK-38237
>                 URL: https://issues.apache.org/jira/browse/SPARK-38237
>             Project: Spark
>          Issue Type: Task
>          Components: SQL, Structured Streaming
>    Affects Versions: 3.3.0
>            Reporter: Jungtaek Lim
>            Priority: Major
>
> We still find HashClusteredDistribution be useful for batch query as well. For example, we had a case with lower parallelism than expected due to the fact ClusteredDistribution is used for aggregation which matches with HashPartitioning with sub-key groups (note that the technical parallelism also depends on "cardinality" - picking sub-key groups means having less cardinality).
> We propose to rename back HashClusteredDistribution with retaining NOTE for stateful operator. The distribution should not be still touched anyway due to the requirement of stateful operator, but can be co-used with batch case if needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org