You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "C. Josephson" <cj...@uhana.io> on 2016/07/21 00:56:18 UTC

Understanding Spark UI DAGs

I just started looking at the DAG for a Spark Streaming job, and had a
couple of questions about it (image inline).

1.) What do the numbers in brackets mean, e.g. PythonRDD[805]?
2.) What code is "RDD at PythonRDD.scala:43" referring to? Is there any way
to tie this back to lines of code we've written in pyspark?

[image: Inline image 1]
Thanks,
-cjosephson

Re: Understanding Spark UI DAGs

Posted by "C. Josephson" <cj...@uhana.io>.
Ok, so those line numbers in our DAG don't refer to our code. Is there any
way to display (or calculate) line numbers that refer to code we actually
wrote, or is that only possible in Scala Spark?

On Thu, Jul 21, 2016 at 12:24 PM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> My little understanding of Python-Spark bridge is that at some point
> the python code communicates over the wire with Spark's backbone that
> includes PythonRDD [1].
>
> When the CallSite can't be computed, it's null:-1 to denote "nothing
> could be referred to".
>
> [1]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Jul 21, 2016 at 8:36 PM, C. Josephson <cj...@uhana.io> wrote:
> >> It's called a CallSite that shows where the line comes from. You can see
> >> the code yourself given the python file and the line number.
> >
> >
> > But that's what I don't understand. Which python file? We spark submit
> one
> > file called ctr_parsing.py, but it only has 150 lines. So what is
> > MapPartitions at PythonRDD.scala:374 referring to? ctr_parsing.py
> imports a
> > number of support functions we wrote, but how do we know which python
> file
> > to look at?
> >
> > Furthermore, what on earth is null:-1 referring to?
>



-- 
Colleen Josephson
Engineering Researcher
Uhana, Inc.

Re: Understanding Spark UI DAGs

Posted by RK Aduri <rk...@collectivei.com>.
That -1 is coming from here:

PythonRDD.writeIteratorToStream(inputIterator, dataOut)
dataOut.writeInt(SpecialLengths.END_OF_DATA_SECTION)   —>  val END_OF_DATA_SECTION = -1
dataOut.writeInt(SpecialLengths.END_OF_STREAM)
dataOut.flush()

> On Jul 21, 2016, at 12:24 PM, Jacek Laskowski <ja...@japila.pl> wrote:
> 
> Hi,
> 
> My little understanding of Python-Spark bridge is that at some point
> the python code communicates over the wire with Spark's backbone that
> includes PythonRDD [1].
> 
> When the CallSite can't be computed, it's null:-1 to denote "nothing
> could be referred to".
> 
> [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
> 
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
> 
> 
> On Thu, Jul 21, 2016 at 8:36 PM, C. Josephson <cj...@uhana.io> wrote:
>>> It's called a CallSite that shows where the line comes from. You can see
>>> the code yourself given the python file and the line number.
>> 
>> 
>> But that's what I don't understand. Which python file? We spark submit one
>> file called ctr_parsing.py, but it only has 150 lines. So what is
>> MapPartitions at PythonRDD.scala:374 referring to? ctr_parsing.py imports a
>> number of support functions we wrote, but how do we know which python file
>> to look at?
>> 
>> Furthermore, what on earth is null:-1 referring to?
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 


-- 
Collective[i] dramatically improves sales and marketing performance using 
technology, applications and a revolutionary network designed to provide 
next generation analytics and decision-support directly to business users. 
Our goal is to maximize human potential and minimize mistakes. In most 
cases, the results are astounding. We cannot, however, stop emails from 
sometimes being sent to the wrong person. If you are not the intended 
recipient, please notify us by replying to this email's sender and deleting 
it (and any attachments) permanently from your system. If you are, please 
respect the confidentiality of this communication's contents.

Re: Understanding Spark UI DAGs

Posted by Jacek Laskowski <ja...@japila.pl>.
Hi,

My little understanding of Python-Spark bridge is that at some point
the python code communicates over the wire with Spark's backbone that
includes PythonRDD [1].

When the CallSite can't be computed, it's null:-1 to denote "nothing
could be referred to".

[1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Jul 21, 2016 at 8:36 PM, C. Josephson <cj...@uhana.io> wrote:
>> It's called a CallSite that shows where the line comes from. You can see
>> the code yourself given the python file and the line number.
>
>
> But that's what I don't understand. Which python file? We spark submit one
> file called ctr_parsing.py, but it only has 150 lines. So what is
> MapPartitions at PythonRDD.scala:374 referring to? ctr_parsing.py imports a
> number of support functions we wrote, but how do we know which python file
> to look at?
>
> Furthermore, what on earth is null:-1 referring to?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Understanding Spark UI DAGs

Posted by "C. Josephson" <cj...@uhana.io>.
>
> It's called a CallSite that shows where the line comes from. You can see
> the code yourself given the python file and the line number.
>

But that's what I don't understand. Which python file? We spark submit one
file called ctr_parsing.py, but it only has 150 lines. So what is
MapPartitions at PythonRDD.scala:374 referring to? ctr_parsing.py imports a
number of support functions we wrote, but how do we know which python file
to look at?

Furthermore, what on earth is null:-1 referring to?

Re: Understanding Spark UI DAGs

Posted by Jacek Laskowski <ja...@japila.pl>.
On Thu, Jul 21, 2016 at 2:56 AM, C. Josephson <cj...@uhana.io> wrote:

> I just started looking at the DAG for a Spark Streaming job, and had a
> couple of questions about it (image inline).
>
> 1.) What do the numbers in brackets mean, e.g. PythonRDD[805]?
>

Every RDD has its identifier (as id attribute) within a SparkContext (which
is the broadest scope an RDD can belong to). In this case, it means you've
already created 806 RDDs (counting from 0).


> 2.) What code is "RDD at PythonRDD.scala:43" referring to? Is there any
> way to tie this back to lines of code we've written in pyspark?
>

It's called a CallSite that shows where the line comes from. You can see
the code yourself given the python file and the line number.

Jacek