You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Vladimir Matveev (JIRA)" <ji...@apache.org> on 2019/01/24 23:59:00 UTC

[jira] [Created] (SPARK-26723) Spark web UI only shows parts of SQL query graphs for queries with persist operations

Vladimir Matveev created SPARK-26723:
----------------------------------------

             Summary: Spark web UI only shows parts of SQL query graphs for queries with persist operations
                 Key: SPARK-26723
                 URL: https://issues.apache.org/jira/browse/SPARK-26723
             Project: Spark
          Issue Type: Bug
          Components: Web UI
    Affects Versions: 2.3.2
            Reporter: Vladimir Matveev


Currently it looks like the SQL view in Spark UI will truncate the graph on the nodes corresponding to persist operations on the dataframe, only showing everything after "LocalTableScan". This is *very* inconvenient, because in a common case when you have a heavy computation and want to persist it before writing to multiple outputs with some minor preprocessing, you lose almost the entire graph with potentially very useful information in it.

The query plans below the graph, however, show the full query, including all computations before persists. Unfortunately, for complex queries looking into the plan is unfeasible, and graph visualization becomes a very helpful tool; with persist, it is apparently broken.

You can verify it in Spark Shell with a very simple example:
{code}
import org.apache.spark.sql.{functions => f}
import org.apache.spark.sql.expressions.Window

val query = Vector(1, 2, 3).toDF()
  .select(($"value".cast("long") * f.rand).as("value"))
  .withColumn("valueAvg", f.avg($"value") over Window.orderBy("value"))
query.show()
query.persist().show()
{code}
Here the same query is executed first without persist, and then with it. If you now navigate to the Spark web UI SQL page, you'll see two queries, but their graphs will be radically different: the one without persist will contain the whole transformation with exchange, sort and window steps, while the one with persist will only contain only a LocalTableScan step with some intermediate transformations needed for `show`.

After some looking into Spark code, I think that the reason for this is that the `org.apache.spark.sql.execution.SparkPlanInfo#fromSparkPlan` method (which is used to serialize a plan before emitting the SparkListenerSQLExecutionStart event) constructs the `SparkPlanInfo` object from a `SparkPlan` object incorrectly, because if you invoke the `toString` method on `SparkPlan` you'll see the entire plan, but the `SparkPlanInfo` object will only contain nodes corresponding to actions after `persist`. However, my knowledge of Spark internals is not deep enough to understand how to fix this, and how SparkPlanInfo.fromSparkPlan is different from what SparkPlan.toString does.

This can be observed on Spark 2.3.2, but given that 2.4.0 code of SparkPlanInfo does not seem to change much since 2.3.2, I'd expect that it could be reproduced there too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org