You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jungtaek Lim (Jira)" <ji...@apache.org> on 2019/12/03 12:37:00 UTC

[jira] [Commented] (SPARK-30101) spark.sql.shuffle.partitions is not in Configuration docs, but a very critical parameter

    [ https://issues.apache.org/jira/browse/SPARK-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986859#comment-16986859 ] 

Jungtaek Lim commented on SPARK-30101:
--------------------------------------

I'm not aware of how configuration page is constructed, but as the page guides on retrieving the list of configurations for Spark SQL, it doesn't enumerate these configs in the global configuration page as Spark SQL has too many configuration by itself.

[https://spark.apache.org/docs/latest/configuration.html#spark-sql]

There's a doc already describing theĀ `spark.sql.shuffle.partitions`, though that is introduced as "other configuration options". We may deal with it we strongly agree about needs for prioritizing this.

[http://spark.apache.org/docs/latest/sql-performance-tuning.html]

Btw, `coalesce` may cover the needs of adding optional parameter, but I guess someone may not feel convenient with this.

> spark.sql.shuffle.partitions is not in Configuration docs, but a very critical parameter
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-30101
>                 URL: https://issues.apache.org/jira/browse/SPARK-30101
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0, 2.4.4
>            Reporter: sam
>            Priority: Major
>
> I'm creating a `SparkSession` like this:
> ```
> SparkSession
>       .builder().appName("foo").master("local")
>       .config("spark.default.parallelism", 2).getOrCreate()
> ```
> when I run
> ```
> ((1 to 10) ++ (1 to 10)).toDS().distinct().count()
> ```
> I get 200 partitions
> ```
> 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks
> ...
> 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 46 ms on localhost (executor driver) (1/200)
> ```
> It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`.  `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work correctly.
> Finally I notice that the good old `RDD` interface has a `distinct` that accepts `numPartitions` partitions, while `Dataset` does not.
> .......................
> According to below comments, it uses spark.sql.shuffle.partitions, which needs documenting in configuration.
> > Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.
> in https://spark.apache.org/docs/latest/configuration.html should say
> > Default number of partitions in RDDs, but not DS/DF (see spark.sql.shuffle.partitions) returned by transformations like join, reduceByKey, and parallelize when not set by user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org