You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Akshay Mendole <ak...@gmail.com> on 2019/03/07 14:47:45 UTC

Re: mapreduce.input.fileinputformat.split.maxsize not working for spark 2.4.0

Hi,
     No. It's a java application that uses RDD APIs.
Thanks,
Akshay


On Mon, Feb 25, 2019 at 7:54 AM Manu Zhang <ow...@gmail.com> wrote:

> Is your application using Spark SQL / DataFrame API ? Is so, please try
> setting
>
> spark.sql.files.maxPartitionBytes
>
> to a larger value which is 128MB by default.
>
> Thanks,
> Manu Zhang
> On Feb 25, 2019, 2:58 AM +0800, Akshay Mendole <ak...@gmail.com>,
> wrote:
>
> Hi,
>    We have dfs.blocksize configured to be 512MB  and we have some large
> files in hdfs that we want to process with spark application. We want to
> split the files get more splits to optimise for memory but the above
> mentioned parameters are not working
> The max and min size params as below are configured to be 50MB still a
> file which is as big as 500MB is read as one split while it is expected to
> split into at least 10 input splits
>
> SparkConf conf = new SparkConf().setAppName(jobName);
>
> SparkContext sparkContext = new SparkContext(conf);
> sparkContext.hadoopConfiguration().set("mapreduce.input.fileinputformat.split.maxsize", "50000000");
> sparkContext.hadoopConfiguration().set("mapreduce.input.fileinputformat.split.minsize", "50000000");
> JavaSparkContext sc = new JavaSparkContext(sparkContext);
> sc.hadoopConfiguration().set("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec");
>
>
> Could you please suggest what could be wrong with my configuration?
>
> Thanks,
> Akshay
>
>