You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2021/01/05 18:12:01 UTC

[jira] [Assigned] (SPARK-34015) SparkR partition timing summary reports input time correctly

     [ https://issues.apache.org/jira/browse/SPARK-34015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-34015:
------------------------------------

    Assignee: Apache Spark

> SparkR partition timing summary reports input time correctly
> ------------------------------------------------------------
>
>                 Key: SPARK-34015
>                 URL: https://issues.apache.org/jira/browse/SPARK-34015
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 2.3.2, 3.0.1
>         Environment: Observed on CentOS-7 running spark 2.3.1 and on my mac running master
>            Reporter: Tom Howland
>            Assignee: Apache Spark
>            Priority: Major
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> When sparkR is run at log level INFO, a summary of how the worker spent its time processing the partition is printed. There is a logic error where it is over-reporting the time inputting rows.
> In detail: the variable inputElap in a wider context is used to mark the beginning of reading rows, but in the part changed here it was used as a local variable for measuring compute time. Thus, the error is not observable if there is only one group per partition, which is what you get in unit tests.
> For our application, here's what a log entry looks like before these changes were applied:
> {{20/10/09 04:08:58 WARN RRunner: Times: boot = 0.013 s, init = 0.005 s, broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, write-output = 0.020 s, total = 1021.546 s}}
> this indicates that we're spending more time reading rows than operating on the rows.
> After these changes, it looks like this:
> {{20/12/15 06:43:29 WARN RRunner: Times: boot = 0.013 s, init = 0.010 s, broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, write-output = 0.045 s, total = 1812.553 s}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org