You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by jjayadeep <ja...@gmail.com> on 2017/03/23 15:36:25 UTC
[PySpark] - Binary File Partition
Hi,
I am using Spark 1.6.2 and is there a known bug where number of partitions
will always be 2 when minPartitions is not specified as below
images =
sc.binaryFiles("s3n://AKIAIOJYJILW24BQSIEA:txGkP6YcOHTjBNHPLFbbgmxPfkVQoyUktsVCVKaf@imagefiles-gok/locofiles-data/")
I was looking at the source code for PortableDataStream.scala which I
believe is used for when we invoke the binary files interface and I see the
below code
def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions:
Int) {
val defaultMaxSplitBytes =
sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
val defaultParallelism = sc.defaultParallelism
val files = listStatus(context).asScala
val totalBytes = files.filterNot(_.isDirectory).map(_.getLen +
openCostInBytes).sum
val bytesPerCore = totalBytes / defaultParallelism
val maxSplitSize = Math.min(defaultMaxSplitBytes,
Math.max(openCostInBytes, bytesPerCore))
super.setMaxSplitSize(maxSplitSize)
}
Does it mean that minPartitions will no longer be used in the partition
determination calculation?
Kindly advice.
Thanks,
Jayadeep
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Binary-File-Partition-tp28531.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org