You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Gengliang Wang (Jira)" <ji...@apache.org> on 2020/06/04 21:22:00 UTC

[jira] [Commented] (SPARK-31903) toPandas with Arrow enabled doesn't show metrics in Query UI.

    [ https://issues.apache.org/jira/browse/SPARK-31903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126192#comment-17126192 ] 

Gengliang Wang commented on SPARK-31903:
----------------------------------------

I tried investigating the issue, I think the SparkListenerSQLExecutionEnd event is sent too early if arrow enabled.
Here is what normal events looks like:

{code:java}
SparkListenerSQLExecutionStart
SparkListenerJobStart
SparkListenerTaskStart
...
SparkListenerTaskEnd
SparkListenerJobEnd
SparkListenerSQLExecutionEnd
{code}

However, with arrow enabled, the events looks like:
{code:java}
SparkListenerSQLExecutionStart
SparkListenerSQLExecutionEnd
SparkListenerJobStart
SparkListenerTaskStart
...
SparkListenerTaskEnd
SparkListenerJobEnd
{code}

The metrics are aggregated on receiving the event "SparkListenerSQLExecutionEnd", which is why there are no metrics in UI.

> toPandas with Arrow enabled doesn't show metrics in Query UI.
> -------------------------------------------------------------
>
>                 Key: SPARK-31903
>                 URL: https://issues.apache.org/jira/browse/SPARK-31903
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.5, 3.0.0
>            Reporter: Takuya Ueshin
>            Priority: Major
>         Attachments: Screen Shot 2020-06-03 at 4.47.07 PM.png, Screen Shot 2020-06-03 at 4.47.27 PM.png
>
>
> When calling {{toPandas}}, usually Query UI shows each plan node's metric and corresponding Stage ID and Task ID:
> {code:java}
> >>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z'])
> >>> df.toPandas()
>    x   y    z
> 0  1  10  abc
> 1  2  20  def
> {code}
> !Screen Shot 2020-06-03 at 4.47.07 PM.png!
> but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct:
> {code:java}
> >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
> >>> df.toPandas()
>    x   y    z
> 0  1  10  abc
> 1  2  20  def{code}
>  
> !Screen Shot 2020-06-03 at 4.47.27 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org