You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Rishi Shah <ri...@gmail.com> on 2019/07/25 01:29:10 UTC

[Pyspark 2.4] Large number of row groups in parquet files created using spark

Hi All,

I have the following code which produces 1 600MB parquet file as expected,
however within this parquet file there are 42 row groups! I would expect it
to crate max 6 row groups, could someone please shed some light on this? Is
there any config setting which I can enable while submitting application
using spark-submit?

df = spark.read.parquet(INPUT_PATH)
df.coalesce(1).write.parquet(OUT_PATH)

I did try --conf spark.parquet.block.size & spark.dfs.blocksize, but that
makes no difference.

-- 
Regards,

Rishi Shah