You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Madhu <ma...@madhu.com> on 2014/12/16 18:09:54 UTC

RDD data flow

I was looking at some of the Partition implementations in core/rdd and
getOrCompute(...) in CacheManager.
It appears that getOrCompute(...) returns an InterruptibleIterator, which
delegates to a wrapped Iterator.
That would imply that Partitions should extend Iterator, but that is not
always the case.
For example, these Partitions for these RDDs do not extend Iterator:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PartitionwiseSampledRDD.scala
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala

Why is that? Shouldn't all Partitions be Iterators? Clearly I'm missing
something.

On a related subject, I was thinking of documenting the data flow of RDDs in
more detail. The code is not hard to follow, but it's nice to have a simple
picture with the major components and some explanation of the flow.  The
declaration of Partition is throwing me off.

Thanks!



-----
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-tp9804.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: RDD data flow

Posted by Madhu <ma...@madhu.com>.

Patrick Wendell wrote
> The Partition itself doesn't need to be an iterator - the iterator
> comes from the result of compute(partition). The Partition is just an
> identifier for that partition, not the data itself.

OK, that makes sense. The docs for Partition are a bit vague on this point.
Maybe I'll add this to the docs.

Thanks Patrick!



-----
--
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-tp9804p9820.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: RDD data flow

Posted by Patrick Wendell <pw...@gmail.com>.

> Why is that? Shouldn't all Partitions be Iterators? Clearly I'm missing
> something.

The Partition itself doesn't need to be an iterator - the iterator
comes from the result of compute(partition). The Partition is just an
identifier for that partition, not the data itself. Take a look at the
signature for compute() in the RDD class.

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L97

>
> On a related subject, I was thinking of documenting the data flow of RDDs in
> more detail. The code is not hard to follow, but it's nice to have a simple
> picture with the major components and some explanation of the flow.  The
> declaration of Partition is throwing me off.
>
> Thanks!
>
>
>
> -----
> --
> Madhu
> https://www.linkedin.com/in/msiddalingaiah
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-tp9804.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org