You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Umar Javed <um...@gmail.com> on 2013/11/17 22:33:11 UTC

number of splits for standalone cluster mode

Hi,

When running Spark in the standalone cluster node, is there a way to
configure the number of splits for the input file(s)? It seems like it is
approximately 32 MB for every core be default. Is that correct? For example
in my cluster there are two workers, each running on a machine with two
cores. For an input file of size 500MB, Spark schedules 16 tasks for the
initial map (500/32 ~ 16)

thanks!
Umar

Re: number of splits for standalone cluster mode

Posted by Aaron Davidson <il...@gmail.com>.

The number of splits can be configured when reading the file, as an
argument to textFile(), sequenceFile(), etc (see
docs<http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext@textFile(String,Int):RDD[String]>).
Note that this is a minimum, however, as certain input sources may not
allow partitions larger than a certain size (e.g., reading from HDFS may
force partitions to be at most ~130 MB [depending on HDFS block size]).

If you wish to have fewer partitions than the minimum your input source
allows, you can use the
RDD.coalesce()<http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD>method
to locally combine partitions.

On Sun, Nov 17, 2013 at 1:33 PM, Umar Javed <um...@gmail.com> wrote:

> Hi,
>
> When running Spark in the standalone cluster node, is there a way to
> configure the number of splits for the input file(s)? It seems like it is
> approximately 32 MB for every core be default. Is that correct? For example
> in my cluster there are two workers, each running on a machine with two
> cores. For an input file of size 500MB, Spark schedules 16 tasks for the
> initial map (500/32 ~ 16)
>
> thanks!
> Umar
>