You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Selvam Raman <se...@gmail.com> on 2016/11/08 21:40:43 UTC

How Spark determines Parquet partition size

Hi,

Can you please tell me how parquet partitions the data while saving the
dataframe.

I have a dataframe which contains 10 values like below

+------------+

|field_num|

+------------+

|         139|

|         140|

|          40|

|          41|

|         148|

|         149|

|         151|

|         152|

|         153|

|         154|

+------------+


df.write.partitionBy("field_num").parquet("/Users/rs/parti/")

it saves the file like (field_num=140,.....filed_num=154)..


when i try the below command it gives 5.

scala> spark.read.parquet("file:///Users/rs/parti").rdd.partitions.length

res4: Int = 5


so how does parquet partitioning the data in spark?


-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"