You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nicholas Chammas <ni...@gmail.com> on 2014/03/24 05:24:04 UTC

How many partitions is my RDD split into?

Hey there fellow Dukes of Data,

How can I tell how many partitions my RDD is split into?

I'm interested in knowing because, from what I gather, having a good number
of partitions is good for performance. If I'm looking to understand how my
pipeline is performing, say for a parallelized write out to HDFS, knowing
how many partitions an RDD has would be a good thing to check.

Is that correct?

I could not find an obvious method or property to see how my RDD is
partitioned. Instead, I devised the following thingy:

def f(idx, itr): yield idx

rdd = sc.parallelize([1, 2, 3, 4], 4)
rdd.mapPartitionsWithIndex(f).count()

Frankly, I'm not sure what I'm doing here, but this seems to give me the
answer I'm looking for. Derp. :)

So in summary, should I care about how finely my RDDs are partitioned? And
how would I check on that?

Nick




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How many partitions is my RDD split into?

Posted by Nicholas Chammas <ni...@gmail.com>.

Sweet! That's simple enough.

Here's a JIRA ticket to track adding this to PySpark for the future:

https://spark-project.atlassian.net/browse/SPARK-1308

Nick


On Mon, Mar 24, 2014 at 4:29 PM, Patrick Wendell <pw...@gmail.com> wrote:

> Ah we should just add this directly in pyspark - it's as simple as the
> code Shivaram just wrote.
>
> - Patrick
>
> On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman
> <sh...@gmail.com> wrote:
> > There is no direct way to get this in pyspark, but you can get it from
> the
> > underlying java rdd. For example
> >
> > a = sc.parallelize([1,2,3,4], 2)
> > a._jrdd.splits().size()
> >
> >
> > On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas
> > <ni...@gmail.com> wrote:
> >>
> >> Mark,
> >>
> >> This appears to be a Scala-only feature. :(
> >>
> >> Patrick,
> >>
> >> Are we planning to add this to PySpark?
> >>
> >> Nick
> >>
> >>
> >> On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra <mark@clearstorydata.com
> >
> >> wrote:
> >>>
> >>> It's much simpler: rdd.partitions.size
> >>>
> >>>
> >>> On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
> >>> <ni...@gmail.com> wrote:
> >>>>
> >>>> Hey there fellow Dukes of Data,
> >>>>
> >>>> How can I tell how many partitions my RDD is split into?
> >>>>
> >>>> I'm interested in knowing because, from what I gather, having a good
> >>>> number of partitions is good for performance. If I'm looking to
> understand
> >>>> how my pipeline is performing, say for a parallelized write out to
> HDFS,
> >>>> knowing how many partitions an RDD has would be a good thing to check.
> >>>>
> >>>> Is that correct?
> >>>>
> >>>> I could not find an obvious method or property to see how my RDD is
> >>>> partitioned. Instead, I devised the following thingy:
> >>>>
> >>>> def f(idx, itr): yield idx
> >>>>
> >>>> rdd = sc.parallelize([1, 2, 3, 4], 4)
> >>>> rdd.mapPartitionsWithIndex(f).count()
> >>>>
> >>>> Frankly, I'm not sure what I'm doing here, but this seems to give me
> the
> >>>> answer I'm looking for. Derp. :)
> >>>>
> >>>> So in summary, should I care about how finely my RDDs are partitioned?
> >>>> And how would I check on that?
> >>>>
> >>>> Nick
> >>>>
> >>>>
> >>>> ________________________________
> >>>> View this message in context: How many partitions is my RDD split
> into?
> >>>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>>
> >>
> >
>

Re: How many partitions is my RDD split into?

Posted by Patrick Wendell <pw...@gmail.com>.

Ah we should just add this directly in pyspark - it's as simple as the
code Shivaram just wrote.

- Patrick

On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman
<sh...@gmail.com> wrote:
> There is no direct way to get this in pyspark, but you can get it from the
> underlying java rdd. For example
>
> a = sc.parallelize([1,2,3,4], 2)
> a._jrdd.splits().size()
>
>
> On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas
> <ni...@gmail.com> wrote:
>>
>> Mark,
>>
>> This appears to be a Scala-only feature. :(
>>
>> Patrick,
>>
>> Are we planning to add this to PySpark?
>>
>> Nick
>>
>>
>> On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra <ma...@clearstorydata.com>
>> wrote:
>>>
>>> It's much simpler: rdd.partitions.size
>>>
>>>
>>> On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
>>> <ni...@gmail.com> wrote:
>>>>
>>>> Hey there fellow Dukes of Data,
>>>>
>>>> How can I tell how many partitions my RDD is split into?
>>>>
>>>> I'm interested in knowing because, from what I gather, having a good
>>>> number of partitions is good for performance. If I'm looking to understand
>>>> how my pipeline is performing, say for a parallelized write out to HDFS,
>>>> knowing how many partitions an RDD has would be a good thing to check.
>>>>
>>>> Is that correct?
>>>>
>>>> I could not find an obvious method or property to see how my RDD is
>>>> partitioned. Instead, I devised the following thingy:
>>>>
>>>> def f(idx, itr): yield idx
>>>>
>>>> rdd = sc.parallelize([1, 2, 3, 4], 4)
>>>> rdd.mapPartitionsWithIndex(f).count()
>>>>
>>>> Frankly, I'm not sure what I'm doing here, but this seems to give me the
>>>> answer I'm looking for. Derp. :)
>>>>
>>>> So in summary, should I care about how finely my RDDs are partitioned?
>>>> And how would I check on that?
>>>>
>>>> Nick
>>>>
>>>>
>>>> ________________________________
>>>> View this message in context: How many partitions is my RDD split into?
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>>
>>
>

Re: How many partitions is my RDD split into?

Posted by Shivaram Venkataraman <sh...@gmail.com>.

There is no direct way to get this in pyspark, but you can get it from the
underlying java rdd. For example

a = sc.parallelize([1,2,3,4], 2)
a._jrdd.splits().size()


On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> Mark,
>
> This appears to be a Scala-only feature. :(
>
> Patrick,
>
> Are we planning to add this to PySpark?
>
> Nick
>
>
> On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra <ma...@clearstorydata.com>wrote:
>
>> It's much simpler: rdd.partitions.size
>>
>>
>> On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>>> Hey there fellow Dukes of Data,
>>>
>>> How can I tell how many partitions my RDD is split into?
>>>
>>> I'm interested in knowing because, from what I gather, having a good
>>> number of partitions is good for performance. If I'm looking to understand
>>> how my pipeline is performing, say for a parallelized write out to HDFS,
>>> knowing how many partitions an RDD has would be a good thing to check.
>>>
>>> Is that correct?
>>>
>>> I could not find an obvious method or property to see how my RDD is
>>> partitioned. Instead, I devised the following thingy:
>>>
>>> def f(idx, itr): yield idx
>>>
>>> rdd = sc.parallelize([1, 2, 3, 4], 4)
>>> rdd.mapPartitionsWithIndex(f).count()
>>>
>>> Frankly, I'm not sure what I'm doing here, but this seems to give me the
>>> answer I'm looking for. Derp. :)
>>>
>>> So in summary, should I care about how finely my RDDs are partitioned?
>>> And how would I check on that?
>>>
>>> Nick
>>>
>>>
>>> ------------------------------
>>> View this message in context: How many partitions is my RDD split into?<http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html>
>>> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>>>
>>
>>
>

Re: How many partitions is my RDD split into?

Posted by Nicholas Chammas <ni...@gmail.com>.

Mark,

This appears to be a Scala-only feature. :(

Patrick,

Are we planning to add this to PySpark?

Nick


On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra <ma...@clearstorydata.com>wrote:

> It's much simpler: rdd.partitions.size
>
>
> On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas <
> nicholas.chammas@gmail.com> wrote:
>
>> Hey there fellow Dukes of Data,
>>
>> How can I tell how many partitions my RDD is split into?
>>
>> I'm interested in knowing because, from what I gather, having a good
>> number of partitions is good for performance. If I'm looking to understand
>> how my pipeline is performing, say for a parallelized write out to HDFS,
>> knowing how many partitions an RDD has would be a good thing to check.
>>
>> Is that correct?
>>
>> I could not find an obvious method or property to see how my RDD is
>> partitioned. Instead, I devised the following thingy:
>>
>> def f(idx, itr): yield idx
>>
>> rdd = sc.parallelize([1, 2, 3, 4], 4)
>> rdd.mapPartitionsWithIndex(f).count()
>>
>> Frankly, I'm not sure what I'm doing here, but this seems to give me the
>> answer I'm looking for. Derp. :)
>>
>> So in summary, should I care about how finely my RDDs are partitioned?
>> And how would I check on that?
>>
>> Nick
>>
>>
>> ------------------------------
>> View this message in context: How many partitions is my RDD split into?<http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html>
>> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>>
>
>

Re: How many partitions is my RDD split into?

Posted by Nicholas Chammas <ni...@gmail.com>.

Oh, glad to know it's that simple!

Patrick, in your last comment did you mean filter in? As in I start with
one year of data and filter it so I have one day left? I'm assuming in that
case the empty partitions would be for all the days that got filtered out.

Nick

2014년 3월 24일 월요일, Patrick Wendell<pw...@gmail.com>님이 작성한 메시지:

> As Mark said you can actually access this easily. The main issue I've
> seen from a performance perspective is people having a bunch of really
> small partitions. This will still work but the performance will
> improve if you coalesce the partitions using rdd.coalesce().
>
> This can happen for example if you do a highly selective filter on an
> RDD. For instance, you filter out one day of data from a dataset of a
> year.
>
> - Patrick
>
> On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra <mark@clearstorydata.com<javascript:;>>
> wrote:
> > It's much simpler: rdd.partitions.size
> >
> >
> > On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
> > <nicholas.chammas@gmail.com <javascript:;>> wrote:
> >>
> >> Hey there fellow Dukes of Data,
> >>
> >> How can I tell how many partitions my RDD is split into?
> >>
> >> I'm interested in knowing because, from what I gather, having a good
> >> number of partitions is good for performance. If I'm looking to
> understand
> >> how my pipeline is performing, say for a parallelized write out to HDFS,
> >> knowing how many partitions an RDD has would be a good thing to check.
> >>
> >> Is that correct?
> >>
> >> I could not find an obvious method or property to see how my RDD is
> >> partitioned. Instead, I devised the following thingy:
> >>
> >> def f(idx, itr): yield idx
> >>
> >> rdd = sc.parallelize([1, 2, 3, 4], 4)
> >> rdd.mapPartitionsWithIndex(f).count()
> >>
> >> Frankly, I'm not sure what I'm doing here, but this seems to give me the
> >> answer I'm looking for. Derp. :)
> >>
> >> So in summary, should I care about how finely my RDDs are partitioned?
> And
> >> how would I check on that?
> >>
> >> Nick
> >>
> >>
> >> ________________________________
> >> View this message in context: How many partitions is my RDD split into?
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> >
>

Re: How many partitions is my RDD split into?

Posted by Patrick Wendell <pw...@gmail.com>.

As Mark said you can actually access this easily. The main issue I've
seen from a performance perspective is people having a bunch of really
small partitions. This will still work but the performance will
improve if you coalesce the partitions using rdd.coalesce().

This can happen for example if you do a highly selective filter on an
RDD. For instance, you filter out one day of data from a dataset of a
year.

- Patrick

On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra <ma...@clearstorydata.com> wrote:
> It's much simpler: rdd.partitions.size
>
>
> On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
> <ni...@gmail.com> wrote:
>>
>> Hey there fellow Dukes of Data,
>>
>> How can I tell how many partitions my RDD is split into?
>>
>> I'm interested in knowing because, from what I gather, having a good
>> number of partitions is good for performance. If I'm looking to understand
>> how my pipeline is performing, say for a parallelized write out to HDFS,
>> knowing how many partitions an RDD has would be a good thing to check.
>>
>> Is that correct?
>>
>> I could not find an obvious method or property to see how my RDD is
>> partitioned. Instead, I devised the following thingy:
>>
>> def f(idx, itr): yield idx
>>
>> rdd = sc.parallelize([1, 2, 3, 4], 4)
>> rdd.mapPartitionsWithIndex(f).count()
>>
>> Frankly, I'm not sure what I'm doing here, but this seems to give me the
>> answer I'm looking for. Derp. :)
>>
>> So in summary, should I care about how finely my RDDs are partitioned? And
>> how would I check on that?
>>
>> Nick
>>
>>
>> ________________________________
>> View this message in context: How many partitions is my RDD split into?
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>

Re: How many partitions is my RDD split into?

Posted by Mark Hamstra <ma...@clearstorydata.com>.

It's much simpler: rdd.partitions.size


On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> Hey there fellow Dukes of Data,
>
> How can I tell how many partitions my RDD is split into?
>
> I'm interested in knowing because, from what I gather, having a good
> number of partitions is good for performance. If I'm looking to understand
> how my pipeline is performing, say for a parallelized write out to HDFS,
> knowing how many partitions an RDD has would be a good thing to check.
>
> Is that correct?
>
> I could not find an obvious method or property to see how my RDD is
> partitioned. Instead, I devised the following thingy:
>
> def f(idx, itr): yield idx
>
> rdd = sc.parallelize([1, 2, 3, 4], 4)
> rdd.mapPartitionsWithIndex(f).count()
>
> Frankly, I'm not sure what I'm doing here, but this seems to give me the
> answer I'm looking for. Derp. :)
>
> So in summary, should I care about how finely my RDDs are partitioned? And
> how would I check on that?
>
> Nick
>
>
> ------------------------------
> View this message in context: How many partitions is my RDD split into?<http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html>
> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>