You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Yang <te...@gmail.com> on 2015/07/20 04:44:30 UTC

how to start reading the spark source code?

I'm trying to understand how spark works under the hood, so I tried to read
the source code.

as I normally do, I downloaded the git source code, reverted to the very
first version ( actually e5c4cd8a5e188592f8786a265c0cd073c69ac886 since the
first version even lacked the definition of RDD.scala)

but the code looks "too simple" and I can't find where the "magic" happens,
i.e. a transformation /computation is scheduled on  a machine, bytes stored
etc.

it would be great if someone could show me a path in which the different
source files are involved, so that I could read each of them in turn.

thanks!
yang

Re: how to start reading the spark source code?

Posted by Yang <te...@gmail.com>.

also one peculiar difference vs Hadoop MR is that the partition/split/part
of RDD is as much an operation as it's data, since an RDD is associated
with a transformation, and a lineage of all its ancestor RDDs. so when the
partition is transferred to a new executor/worker (potentially on another
box), the operation definition / code is transferred together with the data
to that new executor, involving serialization and deserialization. this is
something new, if hadoop MR does this, it needs to transfer the jar file to
worker, but spark being scala, it was easy to transfer the entire operation
(Task[] ) through serialization.

On Mon, Jul 20, 2015 at 12:38 AM, Yang <te...@gmail.com> wrote:

> ok.... got some headstart:
>
> pull the git source to 14719b93ff4ea7c3234a9389621be3c97fa278b9 (first
> release so that I could at least build it)
>
> then build it according to README.md,
> then get eclipse setup , with scala-ide
> then create new scala project, set the project directory to be
> SCALA_SOURCE_HOME/core  instead of the default
>
> in eclipse remove the test from source,
>
> copy all the jars from SCALA_SOURCE_HOME/lib_managed into a separate dir,
> then in eclipse add all these as external jars.
>
> set ur scala project run time to be 2.10.5 (the one coming with spark
> seems to be 2.10.4 , eclipse default is 2.9 something)
> there would be 2 compile errors , one due to Tuple() , change it to
> Tuple2, another one is "currentThread", change it to Thread.currentThread()
>
> then it will build fine
>
> I pasted the hello-world from docs , since the "getting started "doc is
> for latest version, I had to make some minor changes:
>
>
>
> package spark
>
>
>
> import spark.SparkContext
> import spark.SparkContext._
>
> object Tryout {
>   def main(args: Array[String]) {
>     val logFile = "../README.md" // Should be some file on your system
>     val sc = new SparkContext("local", "tryout", ".",
> List(System.getenv("SPARK_EXAMPLES_JAR")))
>     val logData = sc.textFile(logFile, 2).cache()
>
> //    val logData = scala.io.Source.fromFile(args(0)).getLines().toArray
>
>     val numAs = logData.filter(line => line.contains("a")).count()
>     val numBs = logData.filter(line => line.contains("b")).count()
>     println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
>   }
> }
>
>
>
>
> then I debug through this and it became fairly clear
>
> On Sun, Jul 19, 2015 at 10:13 PM, Yang <te...@gmail.com> wrote:
>
>> thanks, my point is that earlier versions are normally much simpler so
>> it's easier to follow. and the basic structure should at least bare great
>> similarity with latest version
>>
>> On Sun, Jul 19, 2015 at 9:27 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>>> e5c4cd8a5e188592f8786a265 was from 2011.
>>>
>>> Not sure why you started with such an early commit.
>>>
>>> Spark project has evolved quite fast.
>>>
>>> I suggest you clone Spark project from github.com/apache/spark/ and
>>> start with core/src/main/scala/org/apache/spark/rdd/RDD.scala
>>>
>>> Cheers
>>>
>>> On Sun, Jul 19, 2015 at 7:44 PM, Yang <te...@gmail.com> wrote:
>>>
>>>> I'm trying to understand how spark works under the hood, so I tried to
>>>> read the source code.
>>>>
>>>> as I normally do, I downloaded the git source code, reverted to the
>>>> very first version ( actually e5c4cd8a5e188592f8786a265c0cd073c69ac886
>>>> since the first version even lacked the definition of RDD.scala)
>>>>
>>>> but the code looks "too simple" and I can't find where the "magic"
>>>> happens, i.e. a transformation /computation is scheduled on  a machine,
>>>> bytes stored etc.
>>>>
>>>> it would be great if someone could show me a path in which the
>>>> different source files are involved, so that I could read each of them in
>>>> turn.
>>>>
>>>> thanks!
>>>> yang
>>>>
>>>
>>>
>>
>

Re: how to start reading the spark source code?

Posted by Yang <te...@gmail.com>.

ok.... got some headstart:

pull the git source to 14719b93ff4ea7c3234a9389621be3c97fa278b9 (first
release so that I could at least build it)

then build it according to README.md,
then get eclipse setup , with scala-ide
then create new scala project, set the project directory to be
SCALA_SOURCE_HOME/core  instead of the default

in eclipse remove the test from source,

copy all the jars from SCALA_SOURCE_HOME/lib_managed into a separate dir,
then in eclipse add all these as external jars.

set ur scala project run time to be 2.10.5 (the one coming with spark seems
to be 2.10.4 , eclipse default is 2.9 something)
there would be 2 compile errors , one due to Tuple() , change it to Tuple2,
another one is "currentThread", change it to Thread.currentThread()

then it will build fine

I pasted the hello-world from docs , since the "getting started "doc is for
latest version, I had to make some minor changes:



package spark



import spark.SparkContext
import spark.SparkContext._

object Tryout {
  def main(args: Array[String]) {
    val logFile = "../README.md" // Should be some file on your system
    val sc = new SparkContext("local", "tryout", ".",
List(System.getenv("SPARK_EXAMPLES_JAR")))
    val logData = sc.textFile(logFile, 2).cache()

//    val logData = scala.io.Source.fromFile(args(0)).getLines().toArray

    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}




then I debug through this and it became fairly clear

On Sun, Jul 19, 2015 at 10:13 PM, Yang <te...@gmail.com> wrote:

> thanks, my point is that earlier versions are normally much simpler so
> it's easier to follow. and the basic structure should at least bare great
> similarity with latest version
>
> On Sun, Jul 19, 2015 at 9:27 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> e5c4cd8a5e188592f8786a265 was from 2011.
>>
>> Not sure why you started with such an early commit.
>>
>> Spark project has evolved quite fast.
>>
>> I suggest you clone Spark project from github.com/apache/spark/ and
>> start with core/src/main/scala/org/apache/spark/rdd/RDD.scala
>>
>> Cheers
>>
>> On Sun, Jul 19, 2015 at 7:44 PM, Yang <te...@gmail.com> wrote:
>>
>>> I'm trying to understand how spark works under the hood, so I tried to
>>> read the source code.
>>>
>>> as I normally do, I downloaded the git source code, reverted to the very
>>> first version ( actually e5c4cd8a5e188592f8786a265c0cd073c69ac886 since the
>>> first version even lacked the definition of RDD.scala)
>>>
>>> but the code looks "too simple" and I can't find where the "magic"
>>> happens, i.e. a transformation /computation is scheduled on  a machine,
>>> bytes stored etc.
>>>
>>> it would be great if someone could show me a path in which the different
>>> source files are involved, so that I could read each of them in turn.
>>>
>>> thanks!
>>> yang
>>>
>>
>>
>

Re: how to start reading the spark source code?

Posted by Yang <te...@gmail.com>.

thanks, my point is that earlier versions are normally much simpler so it's
easier to follow. and the basic structure should at least bare great
similarity with latest version

On Sun, Jul 19, 2015 at 9:27 PM, Ted Yu <yu...@gmail.com> wrote:

> e5c4cd8a5e188592f8786a265 was from 2011.
>
> Not sure why you started with such an early commit.
>
> Spark project has evolved quite fast.
>
> I suggest you clone Spark project from github.com/apache/spark/ and start
> with core/src/main/scala/org/apache/spark/rdd/RDD.scala
>
> Cheers
>
> On Sun, Jul 19, 2015 at 7:44 PM, Yang <te...@gmail.com> wrote:
>
>> I'm trying to understand how spark works under the hood, so I tried to
>> read the source code.
>>
>> as I normally do, I downloaded the git source code, reverted to the very
>> first version ( actually e5c4cd8a5e188592f8786a265c0cd073c69ac886 since the
>> first version even lacked the definition of RDD.scala)
>>
>> but the code looks "too simple" and I can't find where the "magic"
>> happens, i.e. a transformation /computation is scheduled on  a machine,
>> bytes stored etc.
>>
>> it would be great if someone could show me a path in which the different
>> source files are involved, so that I could read each of them in turn.
>>
>> thanks!
>> yang
>>
>
>

Re: how to start reading the spark source code?

Posted by Ted Yu <yu...@gmail.com>.

e5c4cd8a5e188592f8786a265 was from 2011.

Not sure why you started with such an early commit.

Spark project has evolved quite fast.

I suggest you clone Spark project from github.com/apache/spark/ and start
with core/src/main/scala/org/apache/spark/rdd/RDD.scala

Cheers

On Sun, Jul 19, 2015 at 7:44 PM, Yang <te...@gmail.com> wrote:

> I'm trying to understand how spark works under the hood, so I tried to
> read the source code.
>
> as I normally do, I downloaded the git source code, reverted to the very
> first version ( actually e5c4cd8a5e188592f8786a265c0cd073c69ac886 since the
> first version even lacked the definition of RDD.scala)
>
> but the code looks "too simple" and I can't find where the "magic"
> happens, i.e. a transformation /computation is scheduled on  a machine,
> bytes stored etc.
>
> it would be great if someone could show me a path in which the different
> source files are involved, so that I could read each of them in turn.
>
> thanks!
> yang
>