You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by jjayadeep <ja...@gmail.com> on 2017/03/23 15:36:25 UTC

[PySpark] - Binary File Partition

Hi,

I am using Spark 1.6.2 and is there a known bug where number of partitions
will always be 2 when minPartitions is not specified as below

images =
sc.binaryFiles("s3n://AKIAIOJYJILW24BQSIEA:txGkP6YcOHTjBNHPLFbbgmxPfkVQoyUktsVCVKaf@imagefiles-gok/locofiles-data/")

I was looking at the source code for PortableDataStream.scala which I
believe is used for when we invoke the binary files interface and I see the
below code 

  def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions:
Int) {
    val defaultMaxSplitBytes =
sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
    val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
    val defaultParallelism = sc.defaultParallelism
    val files = listStatus(context).asScala
    val totalBytes = files.filterNot(_.isDirectory).map(_.getLen +
openCostInBytes).sum
    val bytesPerCore = totalBytes / defaultParallelism
    val maxSplitSize = Math.min(defaultMaxSplitBytes,
Math.max(openCostInBytes, bytesPerCore))
    super.setMaxSplitSize(maxSplitSize)
  }

Does it mean that minPartitions will no longer be used in the partition
determination calculation?

Kindly advice.

Thanks,
Jayadeep



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Binary-File-Partition-tp28531.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org