You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Gokula Krishnan D <em...@gmail.com> on 2017/07/20 12:46:36 UTC

Spark sc.textFile() files with more partitions Vs files with less partitions

Hello All,

our Spark Applications are designed to process the HDFS Files (Hive
External Tables).

Recently modified the Hive file size by setting the following parameters to
ensure that files are having with the average size of 512MB.
set hive.merge.mapfiles=true
set hive.merge.mapredfiles=true
set hive.merge.smallfiles.avgsize=536870912 (512MB)

Now, I do see the difference in the sc.textFile(HDFS File).count()

Apparently the time has increased drastically since its reading with the
less partitions.

*Is it always better to read any file in Spark with more no.partitions?.
Based on this planning to revert the Hive settings. *


Thanks & Regards,
Gokula Krishnan* (Gokul)*