You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Yin Huai (JIRA)" <ji...@apache.org> on 2016/06/01 04:33:13 UTC

[jira] [Commented] (SPARK-15530) Partitioning discovery logic HadoopFsRelation should use a higher setting of parallelism

    [ https://issues.apache.org/jira/browse/SPARK-15530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15309240#comment-15309240 ] 

Yin Huai commented on SPARK-15530:
----------------------------------

Your change looks reasonable. How about we just take the value of sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold? We can change the doc of this conf and increate the default value. 

> Partitioning discovery logic HadoopFsRelation should use a higher setting of parallelism
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-15530
>                 URL: https://issues.apache.org/jira/browse/SPARK-15530
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Yin Huai
>
> At https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala#L418, we launch a spark job to do parallel file listing in order to discover partitions. However, we do not set the number of partitions at here, which means that we are using the default parallelism of the cluster. It is better to set the number of partitions explicitly to generate smaller tasks, which help load balancing. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org