You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jaonary Rabarisoa <ja...@gmail.com> on 2014/12/05 15:51:40 UTC

Why my default partition size is set to 52 ?

Hi all,

I'm trying to run some spark job with spark-shell. What I want to do is
just to count the number of lines in a file.
I start the spark-shell with the default argument i.e just with
./bin/spark-shell.

Load the text file with sc.textFile("path") and then call count on my data.

When I do this, my data is always split in 52 partitions. I don't
understand why since I run it on a local machine with 8 cores and the
sc.defaultParallelism gives me 8.

Even, if I load the file with sc.textFile("path",8), I always get
data.partitions.size = 52

I use spark 1.1.1.


Any ideas ?



Cheers,

Jao

Re: Why my default partition size is set to 52 ?

Posted by Jaonary Rabarisoa <ja...@gmail.com>.

Ok, I misunderstood the meaning of the partition. In fact, my file is 1.7G
big and with less bigger file I have a different partitions size. Thanks
for this clarification.

On Fri, Dec 5, 2014 at 4:15 PM, Sean Owen <so...@cloudera.com> wrote:

> How big is your file? it's probably of a size that the Hadoop
> InputFormat would make 52 splits for it. Data drives partitions, not
> processing resource. Really, 8 splits is the minimum parallelism you
> want. Several times your # of cores is better.
>
> On Fri, Dec 5, 2014 at 8:51 AM, Jaonary Rabarisoa <ja...@gmail.com>
> wrote:
> > Hi all,
> >
> > I'm trying to run some spark job with spark-shell. What I want to do is
> just
> > to count the number of lines in a file.
> > I start the spark-shell with the default argument i.e just with
> > ./bin/spark-shell.
> >
> > Load the text file with sc.textFile("path") and then call count on my
> data.
> >
> > When I do this, my data is always split in 52 partitions. I don't
> understand
> > why since I run it on a local machine with 8 cores and the
> > sc.defaultParallelism gives me 8.
> >
> > Even, if I load the file with sc.textFile("path",8), I always get
> > data.partitions.size = 52
> >
> > I use spark 1.1.1.
> >
> >
> > Any ideas ?
> >
> >
> >
> > Cheers,
> >
> > Jao
> >
>

Re: Why my default partition size is set to 52 ?

Posted by Sean Owen <so...@cloudera.com>.

How big is your file? it's probably of a size that the Hadoop
InputFormat would make 52 splits for it. Data drives partitions, not
processing resource. Really, 8 splits is the minimum parallelism you
want. Several times your # of cores is better.

On Fri, Dec 5, 2014 at 8:51 AM, Jaonary Rabarisoa <ja...@gmail.com> wrote:
> Hi all,
>
> I'm trying to run some spark job with spark-shell. What I want to do is just
> to count the number of lines in a file.
> I start the spark-shell with the default argument i.e just with
> ./bin/spark-shell.
>
> Load the text file with sc.textFile("path") and then call count on my data.
>
> When I do this, my data is always split in 52 partitions. I don't understand
> why since I run it on a local machine with 8 cores and the
> sc.defaultParallelism gives me 8.
>
> Even, if I load the file with sc.textFile("path",8), I always get
> data.partitions.size = 52
>
> I use spark 1.1.1.
>
>
> Any ideas ?
>
>
>
> Cheers,
>
> Jao
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org