You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Vikash Pareek <vi...@infoobjects.com> on 2017/06/01 10:28:56 UTC

Number Of Partitions in RDD

Hi,

I am creating a RDD from a text file by specifying number of partitions. But
it gives me different number of partitions than the specified one.

*/scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 0)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at textFile
at <console>:27

scala> people.getNumPartitions
res47: Int = 1

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 1)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[50] at textFile
at <console>:27

scala> people.getNumPartitions
res36: Int = 1

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 2)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[52] at textFile
at <console>:27

scala> people.getNumPartitions
res37: Int = 2

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 3)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[54] at textFile
at <console>:27

scala> people.getNumPartitions
res38: Int = 3

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 4)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile
at <console>:27

scala> people.getNumPartitions
res39: Int = 4

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 5)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[58] at textFile
at <console>:27

scala> people.getNumPartitions
res40: Int = 6

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 6)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[60] at textFile
at <console>:27

scala> people.getNumPartitions
res41: Int = 7

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 7)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[62] at textFile
at <console>:27

scala> people.getNumPartitions
res42: Int = 8

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 8)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[64] at textFile
at <console>:27

scala> people.getNumPartitions
res43: Int = 9

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 9)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[66] at textFile
at <console>:27

scala> people.getNumPartitions
res44: Int = 11

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 10)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[68] at textFile
at <console>:27

scala> people.getNumPartitions
res45: Int = 11

scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 11)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[70] at textFile
at <console>:27

scala> people.getNumPartitions
res46: Int = 13/*

Contents of the file /home/pvikash/data/test.txt is:
"
This is a test file.
Will be used for rdd partition
"

I am trying to understand why number of partitions is changing here and in
case we have small data (which can fit into one partition) then why spark
creates empty partitions?

Any explanation would be appreciated.

--Vikash



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Number Of Partitions in RDD

Posted by Michael Mior <mm...@uwaterloo.ca>.
While I'm not sure why you're seeing an increase in partitions with such a
small data file, it's worth noting that the second parameter to textFile is
the *minimum* number of partitions so there's no guarantee you'll get
exactly that number.

--
Michael Mior
mmior@apache.org

2017-06-01 6:28 GMT-04:00 Vikash Pareek <vi...@infoobjects.com>:

> Hi,
>
> I am creating a RDD from a text file by specifying number of partitions.
> But
> it gives me different number of partitions than the specified one.
>
> */scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 0)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at
> textFile
> at <console>:27
>
> scala> people.getNumPartitions
> res47: Int = 1
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 1)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[50] at
> textFile
> at <console>:27
>
> scala> people.getNumPartitions
> res36: Int = 1
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 2)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[52] at
> textFile
> at <console>:27
>
> scala> people.getNumPartitions
> res37: Int = 2
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 3)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[54] at
> textFile
> at <console>:27
>
> scala> people.getNumPartitions
> res38: Int = 3
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 4)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at
> textFile
> at <console>:27
>
> scala> people.getNumPartitions
> res39: Int = 4
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 5)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[58] at
> textFile
> at <console>:27
>
> scala> people.getNumPartitions
> res40: Int = 6
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 6)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[60] at
> textFile
> at <console>:27
>
> scala> people.getNumPartitions
> res41: Int = 7
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 7)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[62] at
> textFile
> at <console>:27
>
> scala> people.getNumPartitions
> res42: Int = 8
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 8)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[64] at
> textFile
> at <console>:27
>
> scala> people.getNumPartitions
> res43: Int = 9
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 9)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[66] at
> textFile
> at <console>:27
>
> scala> people.getNumPartitions
> res44: Int = 11
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 10)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[68] at
> textFile
> at <console>:27
>
> scala> people.getNumPartitions
> res45: Int = 11
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 11)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[70] at
> textFile
> at <console>:27
>
> scala> people.getNumPartitions
> res46: Int = 13/*
>
> Contents of the file /home/pvikash/data/test.txt is:
> "
> This is a test file.
> Will be used for rdd partition
> "
>
> I am trying to understand why number of partitions is changing here and in
> case we have small data (which can fit into one partition) then why spark
> creates empty partitions?
>
> Any explanation would be appreciated.
>
> --Vikash
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Number Of Partitions in RDD

Posted by Vikash Pareek <vi...@infoobjects.com>.
Local mode



-----

__Vikash Pareek
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28786.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Number Of Partitions in RDD

Posted by neil90 <ne...@icloud.com>.
CLuster mode with HDFS? or local mode?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28737.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Number Of Partitions in RDD

Posted by Vikash Pareek <vi...@infoobjects.com>.
Spark 1.6.1



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28735.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Number Of Partitions in RDD

Posted by neil90 <ne...@icloud.com>.
What version of spark of spark are you using?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28732.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org