You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Tim Smith <se...@gmail.com> on 2015/07/28 02:42:00 UTC

Controlling output fileSize in SparkSQL

Hi,

I am using Spark 1.3 (CDH 5.4.4). What's the recipe for setting a minimum
output file size when writing out from SparkSQL? So far, I have tried:
------xxxxx---------
import sqlContext.implicits._
sc.hadoopConfiguration.setBoolean("fs.hdfs.impl.disable.cache",true)
sc.hadoopConfiguration.setLong("fs.local.block.size",1073741824)
sc.hadoopConfiguration.setLong("dfs.blocksize",1073741824)
sqlContext.sql("SET spark.sql.shuffle.partitions=2")
val df = sqlContext.jsonFile("hdfs://nameservice1/user/joe/samplejson/*")
df.saveAsParquetFile("hdfs://nameservice1/user/joe/data/reduceFiles-Parquet")
------xxxxx---------

But my output still isn't aggregated into 1+GB files.

Thanks,

- Siddhartha