You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2016/11/02 06:29:58 UTC
[jira] [Updated] (SPARK-16575) partition calculation mismatch with
sc.binaryFiles
[ https://issues.apache.org/jira/browse/SPARK-16575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin updated SPARK-16575:
--------------------------------
Target Version/s: 2.1.0
> partition calculation mismatch with sc.binaryFiles
> --------------------------------------------------
>
> Key: SPARK-16575
> URL: https://issues.apache.org/jira/browse/SPARK-16575
> Project: Spark
> Issue Type: Bug
> Components: Input/Output, Java API, Shuffle, Spark Core, Spark Shell
> Affects Versions: 1.6.1, 1.6.2
> Reporter: Suhas
> Priority: Critical
>
> sc.binaryFiles is always creating an RDD with number of partitions as 2.
> Steps to reproduce: (Tested this bug on databricks community edition)
> 1. Try to create an RDD using sc.binaryFiles. In this example, airlines folder has 1922 files.
> Ex: {noformat}val binaryRDD = sc.binaryFiles("/databricks-datasets/airlines/*"){noformat}
> 2. check the number of partitions of the above RDD
> - binaryRDD.partitions.size = 2. (expected value is more than 2)
> 3. If the RDD is created using sc.textFile, then the number of partitions are 1921.
> 4. Using the same sc.binaryFiles will create 1921 partitions in Spark 1.5.1 version.
> For explanation with screenshot, please look at the link below,
> http://apache-spark-developers-list.1001551.n3.nabble.com/Partition-calculation-issue-with-sc-binaryFiles-on-Spark-1-6-2-tt18314.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org