You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Wang, Jensen" <je...@sap.com> on 2014/07/22 04:18:15 UTC
defaultMinPartitions in textFile
Hi,
I started to use spark on yarn recently and found a problem while tuning my program.
When SparkContext is initialized as sc and ready to read text file from hdfs, the textFile(path, defaultMinPartitions) method is called.
I traced down the second parameter in the spark source code and finally found this:
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2)) in CoarseGrainedSchedulerBackend.scala
I do not specify the property "spark.default.parallelism" anywhere so the getInt will return value from the larger one between totalCoreCount and 2.
When I submit the application using spark-submit and specify the parameter: --num-executors 2 --executor-cores 6, I suppose the totalCoreCount will be
2*6 = 12, so defaultMinPartitions will be 12.
But when I print the value of defaultMinPartitions in my program, I still get 2 in return, How does this happen, or where do I make a mistake?
RE: defaultMinPartitions in textFile
Posted by "Wang, Jensen" <je...@sap.com>.
Yes, Great. I thought it was math.max instead of math.min on that line. Thank you!
From: Ye Xianjin [mailto:advancedxy@gmail.com]
Sent: Tuesday, July 22, 2014 11:37 AM
To: user@spark.apache.org
Subject: Re: defaultMinPartitions in textFile
well, I think you miss this line of code in SparkContext.scala
line 1242-1243(master):
/** Default min number of partitions for Hadoop RDDs when not given by user */
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
so the defaultMinPartitions will be 2 unless the defaultParallelism is less than 2...
--
Ye Xianjin
Sent with Sparrow<http://www.sparrowmailapp.com/?sig>
On Tuesday, July 22, 2014 at 10:18 AM, Wang, Jensen wrote:
Hi,
I started to use spark on yarn recently and found a problem while tuning my program.
When SparkContext is initialized as sc and ready to read text file from hdfs, the textFile(path, defaultMinPartitions) method is called.
I traced down the second parameter in the spark source code and finally found this:
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2)) in CoarseGrainedSchedulerBackend.scala
I do not specify the property “spark.default.parallelism” anywhere so the getInt will return value from the larger one between totalCoreCount and 2.
When I submit the application using spark-submit and specify the parameter: --num-executors 2 --executor-cores 6, I suppose the totalCoreCount will be
2*6 = 12, so defaultMinPartitions will be 12.
But when I print the value of defaultMinPartitions in my program, I still get 2 in return, How does this happen, or where do I make a mistake?
Re: defaultMinPartitions in textFile
Posted by Ye Xianjin <ad...@gmail.com>.
well, I think you miss this line of code in SparkContext.scala
line 1242-1243(master):
/** Default min number of partitions for Hadoop RDDs when not given by user */
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
so the defaultMinPartitions will be 2 unless the defaultParallelism is less than 2...
--
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
On Tuesday, July 22, 2014 at 10:18 AM, Wang, Jensen wrote:
> Hi,
> I started to use spark on yarn recently and found a problem while tuning my program.
>
> When SparkContext is initialized as sc and ready to read text file from hdfs, the textFile(path, defaultMinPartitions) method is called.
> I traced down the second parameter in the spark source code and finally found this:
> conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2)) in CoarseGrainedSchedulerBackend.scala
>
> I do not specify the property “spark.default.parallelism” anywhere so the getInt will return value from the larger one between totalCoreCount and 2.
>
> When I submit the application using spark-submit and specify the parameter: --num-executors 2 --executor-cores 6, I suppose the totalCoreCount will be
> 2*6 = 12, so defaultMinPartitions will be 12.
>
> But when I print the value of defaultMinPartitions in my program, I still get 2 in return, How does this happen, or where do I make a mistake?
>
>
>