You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2016/11/02 06:29:58 UTC

[jira] [Updated] (SPARK-16575) partition calculation mismatch with sc.binaryFiles

     [ https://issues.apache.org/jira/browse/SPARK-16575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Reynold Xin updated SPARK-16575:
--------------------------------
    Target Version/s: 2.1.0

> partition calculation mismatch with sc.binaryFiles
> --------------------------------------------------
>
>                 Key: SPARK-16575
>                 URL: https://issues.apache.org/jira/browse/SPARK-16575
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, Java API, Shuffle, Spark Core, Spark Shell
>    Affects Versions: 1.6.1, 1.6.2
>            Reporter: Suhas
>            Priority: Critical
>
> sc.binaryFiles is always creating an RDD with number of partitions as 2.
> Steps to reproduce: (Tested this bug on databricks community edition)
> 1. Try to create an RDD using sc.binaryFiles. In this example, airlines folder has 1922 files.
>      Ex: {noformat}val binaryRDD = sc.binaryFiles("/databricks-datasets/airlines/*"){noformat}
> 2. check the number of partitions of the above RDD
>     - binaryRDD.partitions.size = 2. (expected value is more than 2)
> 3. If the RDD is created using sc.textFile, then the number of partitions are 1921.
> 4. Using the same sc.binaryFiles will create 1921 partitions in Spark 1.5.1 version.
> For explanation with screenshot, please look at the link below,
> http://apache-spark-developers-list.1001551.n3.nabble.com/Partition-calculation-issue-with-sc-binaryFiles-on-Spark-1-6-2-tt18314.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org