You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by mhawes <ha...@gmail.com> on 2021/05/21 21:36:35 UTC

Re: Bridging gap between Spark UI and Code

Reviving this thread to ask whether any of the Spark maintainers would
consider helping to scope a solution for this. Michal outlines the problem
in this thread, but to clarify. The issue is that for very complex spark
application where the Logical Plans often span many pages, it is extremely
hard to figure out how the stages in the Spark UI/RDD operations link to the
Logical Plan that generated them.

Now, obviously this is a hard problem to solve given the various
optimisations and transformations that go on in between these two stages.
However I wanted to raise it as a potential option as I think it would be
/extremely/ valuable for Spark users.

My two main ideas are either:
 - To carry a reference to the original plan around when
planning/optimising. 
 - To maintain a separate mapping for each planning/optimisation step that
maps from source to target. Im thinking along the lines of JavaScript
sourcemaps.

It would be great to get the opinion of an experienced Spark maintainer on
this, given the complexity. 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [External Sender] Re: Bridging gap between Spark UI and Code

Posted by Eugene Koifman <eu...@workday.com.INVALID>.

If the question is about SQL operators, Spark 3.1.3 will have Stage id displayed on SQL tab

[Diagram  Description automatically generated]

From: Wenchen Fan <cl...@gmail.com>
Date: Tuesday, May 25, 2021 at 2:31 AM
To: mhawes <ha...@gmail.com>
Cc: Spark dev list <de...@spark.apache.org>
Subject: [External Sender] Re: Bridging gap between Spark UI and Code

You can see the SQL plan node name in the DAG visualization. Please refer to https://spark.apache.org/docs/latest/web-ui.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_web-2Dui.html&d=DwMFaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=wuPgP0XTkA0aSITo752Pl2Mh3FfB1th7k-btT_qhdIA&m=eDh5q2hvAHEG2rntK-KD9H1fXVMejRShaeOiF7d9RCs&s=SvYjj5TkJ55C5PGgmhEn0Vdfd_rwrPvyoORANgXAvJo&e=> for more details. If you still have any confusion, please let us know and we will keep improving the document.

On Tue, May 25, 2021 at 4:41 AM mhawes <ha...@gmail.com>> wrote:
@Wenchen Fan, understood that the mapping of query plan to application code
is very hard. I was wondering if we might be able to instead just handle the
mapping from the final physical plan to the stage graph. So for example
you’d be able to tell what part of the plan generated which stages. I feel
this would provide the most benefit without having to worry about several
optimisation steps.

The main issue as I see it is that currently, if there’s a failing stage,
it’s almost impossible to track down the part of the plan that generated the
stage. Would this be possible? If not, do you have any other suggestions for
this kind of debugging?

Best,
Matt

--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/<https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_&d=DwMFaQ&c=DS6PUFBBr_KiLo7Sjt3ljp5jaW5k2i9ijVXllEdOozc&r=wuPgP0XTkA0aSITo752Pl2Mh3FfB1th7k-btT_qhdIA&m=eDh5q2hvAHEG2rntK-KD9H1fXVMejRShaeOiF7d9RCs&s=DG9A95r84k-9fq1pdS3kSi5YFGrjweVL8t92KY0VtVI&e=>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: Bridging gap between Spark UI and Code

Posted by Wenchen Fan <cl...@gmail.com>.

You can see the SQL plan node name in the DAG visualization. Please refer
to https://spark.apache.org/docs/latest/web-ui.html for more details. If
you still have any confusion, please let us know and we will keep improving
the document.

On Tue, May 25, 2021 at 4:41 AM mhawes <ha...@gmail.com> wrote:

> @Wenchen Fan, understood that the mapping of query plan to application code
> is very hard. I was wondering if we might be able to instead just handle
> the
> mapping from the final physical plan to the stage graph. So for example
> you’d be able to tell what part of the plan generated which stages. I feel
> this would provide the most benefit without having to worry about several
> optimisation steps.
>
> The main issue as I see it is that currently, if there’s a failing stage,
> it’s almost impossible to track down the part of the plan that generated
> the
> stage. Would this be possible? If not, do you have any other suggestions
> for
> this kind of debugging?
>
> Best,
> Matt
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Bridging gap between Spark UI and Code

Posted by mhawes <ha...@gmail.com>.

@Wenchen Fan, understood that the mapping of query plan to application code
is very hard. I was wondering if we might be able to instead just handle the
mapping from the final physical plan to the stage graph. So for example
you’d be able to tell what part of the plan generated which stages. I feel
this would provide the most benefit without having to worry about several
optimisation steps.

The main issue as I see it is that currently, if there’s a failing stage,
it’s almost impossible to track down the part of the plan that generated the
stage. Would this be possible? If not, do you have any other suggestions for
this kind of debugging?

Best,
Matt



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Bridging gap between Spark UI and Code

Posted by Mich Talebzadeh <mi...@gmail.com>.

Plus some operators can be repeated because if a node dies, spark would
need to rebuild that state again from RDD lineage.

HTH

Mich


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 24 May 2021 at 18:22, Wenchen Fan <cl...@gmail.com> wrote:

> I believe you can already see each plan change Spark did to your query
> plan in the debug-level logs. I think it's hard to do in the web UI as
> keeping all these historical query plans is expensive.
>
> Mapping the query plan to your application code is nearly impossible, as
> so many optimizations can happen (some operators can be removed, some
> operators can be replaced by different ones, some operators can be added by
> Spark).
>
> On Mon, May 24, 2021 at 10:30 PM Will Raschkowski
> <wr...@palantir.com.invalid> wrote:
>
>> This would be great.
>>
>>
>>
>> At least for logical nodes, would it be possible to re-use the existing
>> Utils.getCallSite
>> <https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1526>
>> to populate a field when nodes are created? I suppose most value would come
>> from eventually passing the call-sites along to physical nodes. But maybe
>> just as starting point Spark could display the call-site only with
>> unoptimized logical plans? Users would still get a better sense for how the
>> plan’s structure relates to their code.
>>
>>
>>
>> *From: *mhawes <ha...@gmail.com>
>> *Date: *Friday, 21 May 2021 at 22:36
>> *To: *dev@spark.apache.org <de...@spark.apache.org>
>> *Subject: *Re: Bridging gap between Spark UI and Code
>>
>> CAUTION: This email originates from an external party (outside of
>> Palantir). If you believe this message is suspicious in nature, please use
>> the "Report Phishing" button built into Outlook.
>>
>>
>> Reviving this thread to ask whether any of the Spark maintainers would
>> consider helping to scope a solution for this. Michal outlines the problem
>> in this thread, but to clarify. The issue is that for very complex spark
>> application where the Logical Plans often span many pages, it is extremely
>> hard to figure out how the stages in the Spark UI/RDD operations link to
>> the
>> Logical Plan that generated them.
>>
>> Now, obviously this is a hard problem to solve given the various
>> optimisations and transformations that go on in between these two stages.
>> However I wanted to raise it as a potential option as I think it would be
>> /extremely/ valuable for Spark users.
>>
>> My two main ideas are either:
>>  - To carry a reference to the original plan around when
>> planning/optimising.
>>  - To maintain a separate mapping for each planning/optimisation step that
>> maps from source to target. Im thinking along the lines of JavaScript
>> sourcemaps.
>>
>> It would be great to get the opinion of an experienced Spark maintainer on
>> this, given the complexity.
>>
>>
>>
>> --
>> Sent from:
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_&d=DwICAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=HrP36vwrw3UfNOlJ_ndb5EgIQ5INvWvw9xCbXhhQujY&m=jhxzuGxzWWdVR-pHNp2qV4JtVtGoOiAisKfUe-ySPt8&s=S68eCuXKhVzlv12dMdK8YM1YY0BocZ3vMblM_I8E_wo&e=
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>

Re: Bridging gap between Spark UI and Code

Posted by Wenchen Fan <cl...@gmail.com>.

I believe you can already see each plan change Spark did to your query plan
in the debug-level logs. I think it's hard to do in the web UI as keeping
all these historical query plans is expensive.

Mapping the query plan to your application code is nearly impossible, as so
many optimizations can happen (some operators can be removed, some
operators can be replaced by different ones, some operators can be added by
Spark).

On Mon, May 24, 2021 at 10:30 PM Will Raschkowski
<wr...@palantir.com.invalid> wrote:

> This would be great.
>
>
>
> At least for logical nodes, would it be possible to re-use the existing
> Utils.getCallSite
> <https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1526>
> to populate a field when nodes are created? I suppose most value would come
> from eventually passing the call-sites along to physical nodes. But maybe
> just as starting point Spark could display the call-site only with
> unoptimized logical plans? Users would still get a better sense for how the
> plan’s structure relates to their code.
>
>
>
> *From: *mhawes <ha...@gmail.com>
> *Date: *Friday, 21 May 2021 at 22:36
> *To: *dev@spark.apache.org <de...@spark.apache.org>
> *Subject: *Re: Bridging gap between Spark UI and Code
>
> CAUTION: This email originates from an external party (outside of
> Palantir). If you believe this message is suspicious in nature, please use
> the "Report Phishing" button built into Outlook.
>
>
> Reviving this thread to ask whether any of the Spark maintainers would
> consider helping to scope a solution for this. Michal outlines the problem
> in this thread, but to clarify. The issue is that for very complex spark
> application where the Logical Plans often span many pages, it is extremely
> hard to figure out how the stages in the Spark UI/RDD operations link to
> the
> Logical Plan that generated them.
>
> Now, obviously this is a hard problem to solve given the various
> optimisations and transformations that go on in between these two stages.
> However I wanted to raise it as a potential option as I think it would be
> /extremely/ valuable for Spark users.
>
> My two main ideas are either:
>  - To carry a reference to the original plan around when
> planning/optimising.
>  - To maintain a separate mapping for each planning/optimisation step that
> maps from source to target. Im thinking along the lines of JavaScript
> sourcemaps.
>
> It would be great to get the opinion of an experienced Spark maintainer on
> this, given the complexity.
>
>
>
> --
> Sent from:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_&d=DwICAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=HrP36vwrw3UfNOlJ_ndb5EgIQ5INvWvw9xCbXhhQujY&m=jhxzuGxzWWdVR-pHNp2qV4JtVtGoOiAisKfUe-ySPt8&s=S68eCuXKhVzlv12dMdK8YM1YY0BocZ3vMblM_I8E_wo&e=
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>

Re: Bridging gap between Spark UI and Code

Posted by Will Raschkowski <wr...@palantir.com.INVALID>.

This would be great.

At least for logical nodes, would it be possible to re-use the existing Utils.getCallSite<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L1526> to populate a field when nodes are created? I suppose most value would come from eventually passing the call-sites along to physical nodes. But maybe just as starting point Spark could display the call-site only with unoptimized logical plans? Users would still get a better sense for how the plan’s structure relates to their code.

From: mhawes <ha...@gmail.com>
Date: Friday, 21 May 2021 at 22:36
To: dev@spark.apache.org <de...@spark.apache.org>
Subject: Re: Bridging gap between Spark UI and Code
CAUTION: This email originates from an external party (outside of Palantir). If you believe this message is suspicious in nature, please use the "Report Phishing" button built into Outlook.

Reviving this thread to ask whether any of the Spark maintainers would
consider helping to scope a solution for this. Michal outlines the problem
in this thread, but to clarify. The issue is that for very complex spark
application where the Logical Plans often span many pages, it is extremely
hard to figure out how the stages in the Spark UI/RDD operations link to the
Logical Plan that generated them.

Now, obviously this is a hard problem to solve given the various
optimisations and transformations that go on in between these two stages.
However I wanted to raise it as a potential option as I think it would be
/extremely/ valuable for Spark users.

My two main ideas are either:
 - To carry a reference to the original plan around when
planning/optimising.
 - To maintain a separate mapping for each planning/optimisation step that
maps from source to target. Im thinking along the lines of JavaScript
sourcemaps.

It would be great to get the opinion of an experienced Spark maintainer on
this, given the complexity.

--
Sent from: https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_&d=DwICAg&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=HrP36vwrw3UfNOlJ_ndb5EgIQ5INvWvw9xCbXhhQujY&m=jhxzuGxzWWdVR-pHNp2qV4JtVtGoOiAisKfUe-ySPt8&s=S68eCuXKhVzlv12dMdK8YM1YY0BocZ3vMblM_I8E_wo&e=

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org