You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Baohe Zhang (Jira)" <ji...@apache.org> on 2021/03/23 22:37:00 UTC

[jira] [Updated] (SPARK-34845) ProcfsMetricsGetter.computeAllMetrics may return partial metrics when some of child pids metrics are missing

     [ https://issues.apache.org/jira/browse/SPARK-34845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Baohe Zhang updated SPARK-34845:
--------------------------------
    Description: 
When the procfs metrics of some child pids are unavailable, ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum of a subset of child pids), instead of an all 0 result. This can be misleading and is undesired per the current code comments in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].

How to reproduce it?

This unit test is kind of self-explanatory:
{code:java}
    val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics"))
    val mockedP = spy(p)

    // proc file of pid 22764 doesn't exist, so partial metrics shouldn't be returned
    var ptree = Set(26109, 22764, 22763)
    when(mockedP.computeProcessTree).thenReturn(ptree)
    var r = mockedP.computeAllMetrics
    assert(r.jvmVmemTotal == 0)
    assert(r.jvmRSSTotal == 0)
    assert(r.pythonVmemTotal == 0)
    assert(r.pythonRSSTotal == 0)
{code}
In the current implementation, computeAllMetrics will reset the allMetrics to 0 when processing 22764 because 22764's proc file doesn't exist, but then it will continue processing pid 22763, and update allMetrics to procfs metrics of pid 22763.

Also, a side effect of this bug is that it can lead to a verbose warning log if many pids' stat files are missing. An early terminating can make the warning logs more concise.

How to solve it?

The issue can be fixed by throwing IOException to computeAllMetrics(), in that case computeAllMetrics can aware that at lease one child pid's procfs metrics is missing and then terminate the metrics reporting.

  was:
When the procfs metrics of some child pids are unavailable, ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum of a subset of child pids), instead of an all 0 result. This can be misleading and is undesired per the current code comments in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].

How to reproduce it?

This unit test is kind of self-explanatory:

 
{code:java}
    val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics"))
    val mockedP = spy(p)

    // proc file of pid 22764 doesn't exist, so partial metrics shouldn't be returned
    var ptree = Set(26109, 22764, 22763)
    when(mockedP.computeProcessTree).thenReturn(ptree)
    var r = mockedP.computeAllMetrics
    assert(r.jvmVmemTotal == 0)
    assert(r.jvmRSSTotal == 0)
    assert(r.pythonVmemTotal == 0)
    assert(r.pythonRSSTotal == 0)
{code}
In the current implementation, computeAllMetrics will reset the allMetrics to 0 when processing 22764 because 22764's proc file doesn't exist, but then it will continue processing pid 22763, and update allMetrics to procfs metrics of pid 22763.

Also, a side effect of this bug is that it can lead to a verbose warning log if many pids' stat files are missing. An early terminating can make the warning logs more concise.

How to solve it?

The issue can be fixed by throwing IOException to computeAllMetrics(), in that case computeAllMetrics can aware that at lease one child pid's procfs metrics is missing and then terminate the metrics reporting.


> ProcfsMetricsGetter.computeAllMetrics may return partial metrics when some of child pids metrics are missing
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-34845
>                 URL: https://issues.apache.org/jira/browse/SPARK-34845
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>            Reporter: Baohe Zhang
>            Priority: Major
>
> When the procfs metrics of some child pids are unavailable, ProcfsMetricsGetter.computeAllMetrics() may return partial metrics (the sum of a subset of child pids), instead of an all 0 result. This can be misleading and is undesired per the current code comments in [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala#L214].
> How to reproduce it?
> This unit test is kind of self-explanatory:
> {code:java}
>     val p = new ProcfsMetricsGetter(getTestResourcePath("ProcfsMetrics"))
>     val mockedP = spy(p)
>     // proc file of pid 22764 doesn't exist, so partial metrics shouldn't be returned
>     var ptree = Set(26109, 22764, 22763)
>     when(mockedP.computeProcessTree).thenReturn(ptree)
>     var r = mockedP.computeAllMetrics
>     assert(r.jvmVmemTotal == 0)
>     assert(r.jvmRSSTotal == 0)
>     assert(r.pythonVmemTotal == 0)
>     assert(r.pythonRSSTotal == 0)
> {code}
> In the current implementation, computeAllMetrics will reset the allMetrics to 0 when processing 22764 because 22764's proc file doesn't exist, but then it will continue processing pid 22763, and update allMetrics to procfs metrics of pid 22763.
> Also, a side effect of this bug is that it can lead to a verbose warning log if many pids' stat files are missing. An early terminating can make the warning logs more concise.
> How to solve it?
> The issue can be fixed by throwing IOException to computeAllMetrics(), in that case computeAllMetrics can aware that at lease one child pid's procfs metrics is missing and then terminate the metrics reporting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org