You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by nickbp <gi...@git.apache.org> on 2018/06/08 23:45:02 UTC
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
GitHub user nickbp opened a pull request:
https://github.com/apache/spark/pull/21516
[SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Adds metrics for the Mesos Dispatcher:
- Counters: The total number of times that submissions have entered states
- Timers: The duration from submit or launch until a submission entered a given state
- Histogram: The retry counts at time of retry
Also adds metrics for the Mesos coarse-grained driver implementation. The deprecated fine-grained driver is left as-is.
The driver metrics include e.g. frequency of RPCs to/from Mesos, the internal state tracking of tasks/agents, and durations for bringing up the tasks.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/nickbp/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21516.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21516
----
commit 17c4c755703019ebe835404a626f2298fd9e0171
Author: Nicholas Parker <ni...@...>
Date: 2018-05-04T17:11:23Z
[SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Adds metrics for the Mesos Dispatcher:
- Counters: The total number of times that submissions have entered states
- Timers: The duration from submit or launch until a submission entered a given state
- Histogram: The retry counts at time of retry
Also adds metrics for the Mesos coarse-grained driver implementation. The deprecated fine-grained driver is left as-is.
The driver metrics include e.g. frequency of RPCs to/from Mesos, the internal state tracking of tasks/agents, and durations for bringing up the tasks.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21516
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93657/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r194687233
--- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterSchedulerSource.scala ---
@@ -17,25 +17,170 @@
package org.apache.spark.scheduler.cluster.mesos
-import com.codahale.metrics.{Gauge, MetricRegistry}
+import java.util.concurrent.TimeUnit
+import java.util.Date
+import scala.collection.mutable.HashMap
+
+import com.codahale.metrics.{Counter, Gauge, MetricRegistry, Timer}
--- End diff --
Counter is unused import.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:
https://github.com/apache/spark/pull/21516
jenkins, please test this.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/21516
Jenkins, ok to test
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by nickbp <gi...@git.apache.org>.
Github user nickbp commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r200463853
--- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala ---
@@ -122,8 +122,9 @@ private[spark] class MesosClusterScheduler(
conf: SparkConf)
extends Scheduler with MesosSchedulerUtils {
var frameworkUrl: String = _
+ private val metricsSource = new MesosClusterSchedulerSource(this)
--- End diff --
This is just moving the prior initialization from L308 slightly earlier. This is just being done to make the `sourceName` member available below (avoid duplicate `mesos_cluster` constants across different files).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by nickbp <gi...@git.apache.org>.
Github user nickbp commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r200470228
--- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterSchedulerSource.scala ---
@@ -17,25 +17,170 @@
package org.apache.spark.scheduler.cluster.mesos
-import com.codahale.metrics.{Gauge, MetricRegistry}
+import java.util.concurrent.TimeUnit
+import java.util.Date
+import scala.collection.mutable.HashMap
+
+import com.codahale.metrics.{Counter, Gauge, MetricRegistry, Timer}
+import org.apache.mesos.Protos.{TaskState => MesosTaskState}
+
+import org.apache.spark.TaskState
+import org.apache.spark.deploy.mesos.MesosDriverDescription
import org.apache.spark.metrics.source.Source
private[mesos] class MesosClusterSchedulerSource(scheduler: MesosClusterScheduler)
- extends Source {
+ extends Source with MesosSchedulerUtils {
+
+ // Submission state transitions, to derive metrics from:
+ // - submit():
+ // From: NULL
+ // To: queuedDrivers
+ // - offers/scheduleTasks():
+ // From: queuedDrivers and any pendingRetryDrivers scheduled for retry
+ // To: launchedDrivers if success, or
+ // finishedDrivers(fail) if exception
+ // - taskStatus/statusUpdate():
+ // From: launchedDrivers
+ // To: finishedDrivers(success) if success (or fail and not eligible to retry), or
+ // pendingRetryDrivers if failed (and eligible to retry)
+ // - pruning/retireDriver():
+ // From: finishedDrivers:
+ // To: NULL
override val sourceName: String = "mesos_cluster"
- override val metricRegistry: MetricRegistry = new MetricRegistry()
+ override val metricRegistry: MetricRegistry = new MetricRegistry
- metricRegistry.register(MetricRegistry.name("waitingDrivers"), new Gauge[Int] {
+ // PULL METRICS:
+ // These gauge metrics are periodically polled/pulled by the metrics system
+
+ metricRegistry.register(MetricRegistry.name("driver", "waiting"), new Gauge[Int] {
override def getValue: Int = scheduler.getQueuedDriversSize
})
- metricRegistry.register(MetricRegistry.name("launchedDrivers"), new Gauge[Int] {
+ metricRegistry.register(MetricRegistry.name("driver", "launched"), new Gauge[Int] {
override def getValue: Int = scheduler.getLaunchedDriversSize
})
- metricRegistry.register(MetricRegistry.name("retryDrivers"), new Gauge[Int] {
+ metricRegistry.register(MetricRegistry.name("driver", "retry"), new Gauge[Int] {
override def getValue: Int = scheduler.getPendingRetryDriversSize
})
+
+ metricRegistry.register(MetricRegistry.name("driver", "finished"), new Gauge[Int] {
+ override def getValue: Int = scheduler.getFinishedDriversSize
+ })
+
+ // PUSH METRICS:
+ // These metrics are updated directly as events occur
+
+ private val queuedCounter = metricRegistry.counter(MetricRegistry.name("driver", "waiting_count"))
+ private val launchedCounter =
+ metricRegistry.counter(MetricRegistry.name("driver", "launched_count"))
+ private val retryCounter = metricRegistry.counter(MetricRegistry.name("driver", "retry_count"))
+ private val exceptionCounter =
+ metricRegistry.counter(MetricRegistry.name("driver", "exception_count"))
+ private val finishedCounter =
+ metricRegistry.counter(MetricRegistry.name("driver", "finished_count"))
+
+ // Same as finishedCounter above, except grouped by MesosTaskState.
+ private val finishedMesosStateCounters = MesosTaskState.values
+ // Avoid registering 'finished' metrics for states that aren't considered finished:
+ .filter(state => TaskState.isFinished(mesosToTaskState(state)))
+ .map(state => (state, metricRegistry.counter(
+ MetricRegistry.name("driver", "finished_count_mesos_state", state.name.toLowerCase))))
+ .toMap
+ private val finishedMesosUnknownStateCounter =
+ metricRegistry.counter(MetricRegistry.name("driver", "finished_count_mesos_state", "UNKNOWN"))
+
+ // Duration from submission to FIRST launch.
+ // This omits retries since those would exaggerate the time since original submission.
+ private val submitToFirstLaunch =
+ metricRegistry.timer(MetricRegistry.name("driver", "submit_to_first_launch"))
+ // Duration from initial submission to an exception.
+ private val submitToException =
+ metricRegistry.timer(MetricRegistry.name("driver", "submit_to_exception"))
+
+ // Duration from (most recent) launch to a retry.
+ private val launchToRetry = metricRegistry.timer(MetricRegistry.name("driver", "launch_to_retry"))
+
+ // Duration from initial submission to finished.
+ private val submitToFinish =
+ metricRegistry.timer(MetricRegistry.name("driver", "submit_to_finish"))
+ // Duration from (most recent) launch to finished.
+ private val launchToFinish =
+ metricRegistry.timer(MetricRegistry.name("driver", "launch_to_finish"))
+
+ // Same as submitToFinish and launchToFinish above, except grouped by Spark TaskState.
+ class FinishStateTimers(state: String) {
+ val submitToFinish =
+ metricRegistry.timer(MetricRegistry.name("driver", "submit_to_finish_state", state))
+ val launchToFinish =
+ metricRegistry.timer(MetricRegistry.name("driver", "launch_to_finish_state", state))
+ }
+ private val finishSparkStateTimers = HashMap.empty[TaskState.TaskState, FinishStateTimers]
+ for (state <- TaskState.values) {
+ // Avoid registering 'finished' metrics for states that aren't considered finished:
+ if (TaskState.isFinished(state)) {
+ finishSparkStateTimers += (state -> new FinishStateTimers(state.toString.toLowerCase))
+ }
+ }
+ private val submitToFinishUnknownState = metricRegistry.timer(
+ MetricRegistry.name("driver", "submit_to_finish_state", "UNKNOWN"))
+ private val launchToFinishUnknownState = metricRegistry.timer(
+ MetricRegistry.name("driver", "launch_to_finish_state", "UNKNOWN"))
+
+ // Histogram of retry counts at retry scheduling
+ private val retryCount = metricRegistry.histogram(MetricRegistry.name("driver", "retry_counts"))
+
+ // Records when a submission initially enters the launch queue.
+ def recordQueuedDriver(): Unit = queuedCounter.inc
+
+ // Records when a submission has failed an attempt and is eligible to be retried
+ def recordRetryingDriver(state: MesosClusterSubmissionState): Unit = {
+ state.driverDescription.retryState.foreach(retryState => retryCount.update(retryState.retries))
+ recordTimeSince(state.startDate, launchToRetry)
+ retryCounter.inc
+ }
+
+ // Records when a submission is launched.
+ def recordLaunchedDriver(desc: MesosDriverDescription): Unit = {
+ if (!desc.retryState.isDefined) {
--- End diff --
Switched to `isEmpty`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r194689746
--- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerSource.scala ---
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.cluster.mesos
+
+import java.util.concurrent.TimeUnit
+import java.util.concurrent.atomic.AtomicBoolean
+import java.util.Date
--- End diff --
Same as previously. Scala style here: java.util.Date is in wrong order relative to java.util.concurrent.atomic.AtomicBoolean
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21516
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21516
**[Test build #93875 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93875/testReport)** for PR 21516 at commit [`50c1c1e`](https://github.com/apache/spark/commit/50c1c1e810fa27480ae7e72640cc8f67b44a60f1).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r194677777
--- Diff: core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala ---
@@ -205,7 +208,7 @@ private[spark] class MetricsSystem private (
}
} catch {
case e: Exception =>
- logError("Sink class " + classPath + " cannot be instantiated")
+ logError(s"Sink class $classPath cannot be instantiated")
--- End diff --
+1 for all the log msgs. One question is should some of them be at debug level?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by nickbp <gi...@git.apache.org>.
Github user nickbp commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r200469780
--- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterSchedulerSource.scala ---
@@ -17,25 +17,170 @@
package org.apache.spark.scheduler.cluster.mesos
-import com.codahale.metrics.{Gauge, MetricRegistry}
+import java.util.concurrent.TimeUnit
+import java.util.Date
+import scala.collection.mutable.HashMap
+
+import com.codahale.metrics.{Counter, Gauge, MetricRegistry, Timer}
--- End diff --
Removed
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by nickbp <gi...@git.apache.org>.
Github user nickbp commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r200470719
--- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerSource.scala ---
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.cluster.mesos
+
+import java.util.concurrent.TimeUnit
+import java.util.concurrent.atomic.AtomicBoolean
+import java.util.Date
+
+import scala.collection.mutable.HashMap
+
+import com.codahale.metrics.{Counter, Gauge, MetricRegistry, Timer}
+import org.apache.mesos.Protos.{TaskState => MesosTaskState}
+
+import org.apache.spark.TaskState
+import org.apache.spark.deploy.mesos.MesosDriverDescription
--- End diff --
Cleaned up
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r194690227
--- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerSource.scala ---
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.cluster.mesos
+
+import java.util.concurrent.TimeUnit
+import java.util.concurrent.atomic.AtomicBoolean
+import java.util.Date
+
+import scala.collection.mutable.HashMap
+
+import com.codahale.metrics.{Counter, Gauge, MetricRegistry, Timer}
--- End diff --
Counter unused import.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by nickbp <gi...@git.apache.org>.
Github user nickbp commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r200470645
--- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerSource.scala ---
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.cluster.mesos
+
+import java.util.concurrent.TimeUnit
+import java.util.concurrent.atomic.AtomicBoolean
+import java.util.Date
--- End diff --
Reordered
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/21516
@tnachen @susanxhuynh @mgummelt @skonto
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by nickbp <gi...@git.apache.org>.
Github user nickbp commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r200469746
--- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterSchedulerSource.scala ---
@@ -17,25 +17,170 @@
package org.apache.spark.scheduler.cluster.mesos
-import com.codahale.metrics.{Gauge, MetricRegistry}
+import java.util.concurrent.TimeUnit
+import java.util.Date
--- End diff --
Reordered
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r194687110
--- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterSchedulerSource.scala ---
@@ -17,25 +17,170 @@
package org.apache.spark.scheduler.cluster.mesos
-import com.codahale.metrics.{Gauge, MetricRegistry}
+import java.util.concurrent.TimeUnit
+import java.util.Date
--- End diff --
This causes a scala style error:
java.util.Date is in wrong order relative to java.util.concurrent.TimeUnit
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21516
**[Test build #93875 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93875/testReport)** for PR 21516 at commit [`50c1c1e`](https://github.com/apache/spark/commit/50c1c1e810fa27480ae7e72640cc8f67b44a60f1).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r194688618
--- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala ---
@@ -796,6 +811,38 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
None
}
}
+
+ // Calls used for metrics polling, see MesosCoarseGrainedSchedulerSource:
+
+ def getCoresUsed(): Double = totalCoresAcquired
--- End diff --
Are there any issues with concurrency here. When
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21516
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93875/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on the issue:
https://github.com/apache/spark/pull/21516
@nickbp Any isntructions how to run this? Could you also provide a design doc?
There are some isntructions [here](https://github.com/mesosphere/spark-build/blob/dd9d9891d87040825fd9e45195ebed7e5a47e027/monitoring/README.md) for example, but it would be good to have this in Spark on mesos, describe what you get as metrics for the both the driver and the dispatcher.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21516
**[Test build #93424 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93424/testReport)** for PR 21516 at commit [`50c1c1e`](https://github.com/apache/spark/commit/50c1c1e810fa27480ae7e72640cc8f67b44a60f1).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21516
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21516
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21516
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r194687575
--- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterSchedulerSource.scala ---
@@ -17,25 +17,170 @@
package org.apache.spark.scheduler.cluster.mesos
-import com.codahale.metrics.{Gauge, MetricRegistry}
+import java.util.concurrent.TimeUnit
+import java.util.Date
+import scala.collection.mutable.HashMap
+
+import com.codahale.metrics.{Counter, Gauge, MetricRegistry, Timer}
+import org.apache.mesos.Protos.{TaskState => MesosTaskState}
+
+import org.apache.spark.TaskState
+import org.apache.spark.deploy.mesos.MesosDriverDescription
import org.apache.spark.metrics.source.Source
private[mesos] class MesosClusterSchedulerSource(scheduler: MesosClusterScheduler)
- extends Source {
+ extends Source with MesosSchedulerUtils {
+
+ // Submission state transitions, to derive metrics from:
+ // - submit():
+ // From: NULL
+ // To: queuedDrivers
+ // - offers/scheduleTasks():
+ // From: queuedDrivers and any pendingRetryDrivers scheduled for retry
+ // To: launchedDrivers if success, or
+ // finishedDrivers(fail) if exception
+ // - taskStatus/statusUpdate():
+ // From: launchedDrivers
+ // To: finishedDrivers(success) if success (or fail and not eligible to retry), or
+ // pendingRetryDrivers if failed (and eligible to retry)
+ // - pruning/retireDriver():
+ // From: finishedDrivers:
+ // To: NULL
override val sourceName: String = "mesos_cluster"
- override val metricRegistry: MetricRegistry = new MetricRegistry()
+ override val metricRegistry: MetricRegistry = new MetricRegistry
- metricRegistry.register(MetricRegistry.name("waitingDrivers"), new Gauge[Int] {
+ // PULL METRICS:
+ // These gauge metrics are periodically polled/pulled by the metrics system
+
+ metricRegistry.register(MetricRegistry.name("driver", "waiting"), new Gauge[Int] {
override def getValue: Int = scheduler.getQueuedDriversSize
})
- metricRegistry.register(MetricRegistry.name("launchedDrivers"), new Gauge[Int] {
+ metricRegistry.register(MetricRegistry.name("driver", "launched"), new Gauge[Int] {
override def getValue: Int = scheduler.getLaunchedDriversSize
})
- metricRegistry.register(MetricRegistry.name("retryDrivers"), new Gauge[Int] {
+ metricRegistry.register(MetricRegistry.name("driver", "retry"), new Gauge[Int] {
override def getValue: Int = scheduler.getPendingRetryDriversSize
})
+
+ metricRegistry.register(MetricRegistry.name("driver", "finished"), new Gauge[Int] {
+ override def getValue: Int = scheduler.getFinishedDriversSize
+ })
+
+ // PUSH METRICS:
+ // These metrics are updated directly as events occur
+
+ private val queuedCounter = metricRegistry.counter(MetricRegistry.name("driver", "waiting_count"))
+ private val launchedCounter =
+ metricRegistry.counter(MetricRegistry.name("driver", "launched_count"))
+ private val retryCounter = metricRegistry.counter(MetricRegistry.name("driver", "retry_count"))
+ private val exceptionCounter =
+ metricRegistry.counter(MetricRegistry.name("driver", "exception_count"))
+ private val finishedCounter =
+ metricRegistry.counter(MetricRegistry.name("driver", "finished_count"))
+
+ // Same as finishedCounter above, except grouped by MesosTaskState.
+ private val finishedMesosStateCounters = MesosTaskState.values
+ // Avoid registering 'finished' metrics for states that aren't considered finished:
+ .filter(state => TaskState.isFinished(mesosToTaskState(state)))
+ .map(state => (state, metricRegistry.counter(
+ MetricRegistry.name("driver", "finished_count_mesos_state", state.name.toLowerCase))))
+ .toMap
+ private val finishedMesosUnknownStateCounter =
+ metricRegistry.counter(MetricRegistry.name("driver", "finished_count_mesos_state", "UNKNOWN"))
+
+ // Duration from submission to FIRST launch.
+ // This omits retries since those would exaggerate the time since original submission.
+ private val submitToFirstLaunch =
+ metricRegistry.timer(MetricRegistry.name("driver", "submit_to_first_launch"))
+ // Duration from initial submission to an exception.
+ private val submitToException =
+ metricRegistry.timer(MetricRegistry.name("driver", "submit_to_exception"))
+
+ // Duration from (most recent) launch to a retry.
+ private val launchToRetry = metricRegistry.timer(MetricRegistry.name("driver", "launch_to_retry"))
+
+ // Duration from initial submission to finished.
+ private val submitToFinish =
+ metricRegistry.timer(MetricRegistry.name("driver", "submit_to_finish"))
+ // Duration from (most recent) launch to finished.
+ private val launchToFinish =
+ metricRegistry.timer(MetricRegistry.name("driver", "launch_to_finish"))
+
+ // Same as submitToFinish and launchToFinish above, except grouped by Spark TaskState.
+ class FinishStateTimers(state: String) {
+ val submitToFinish =
+ metricRegistry.timer(MetricRegistry.name("driver", "submit_to_finish_state", state))
+ val launchToFinish =
+ metricRegistry.timer(MetricRegistry.name("driver", "launch_to_finish_state", state))
+ }
+ private val finishSparkStateTimers = HashMap.empty[TaskState.TaskState, FinishStateTimers]
+ for (state <- TaskState.values) {
+ // Avoid registering 'finished' metrics for states that aren't considered finished:
+ if (TaskState.isFinished(state)) {
+ finishSparkStateTimers += (state -> new FinishStateTimers(state.toString.toLowerCase))
+ }
+ }
+ private val submitToFinishUnknownState = metricRegistry.timer(
+ MetricRegistry.name("driver", "submit_to_finish_state", "UNKNOWN"))
+ private val launchToFinishUnknownState = metricRegistry.timer(
+ MetricRegistry.name("driver", "launch_to_finish_state", "UNKNOWN"))
+
+ // Histogram of retry counts at retry scheduling
+ private val retryCount = metricRegistry.histogram(MetricRegistry.name("driver", "retry_counts"))
+
+ // Records when a submission initially enters the launch queue.
+ def recordQueuedDriver(): Unit = queuedCounter.inc
+
+ // Records when a submission has failed an attempt and is eligible to be retried
+ def recordRetryingDriver(state: MesosClusterSubmissionState): Unit = {
+ state.driverDescription.retryState.foreach(retryState => retryCount.update(retryState.retries))
+ recordTimeSince(state.startDate, launchToRetry)
+ retryCounter.inc
+ }
+
+ // Records when a submission is launched.
+ def recordLaunchedDriver(desc: MesosDriverDescription): Unit = {
+ if (!desc.retryState.isDefined) {
--- End diff --
How about desc.retryState.isEmpty()?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21516
**[Test build #93424 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93424/testReport)** for PR 21516 at commit [`50c1c1e`](https://github.com/apache/spark/commit/50c1c1e810fa27480ae7e72640cc8f67b44a60f1).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21516
**[Test build #93657 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93657/testReport)** for PR 21516 at commit [`50c1c1e`](https://github.com/apache/spark/commit/50c1c1e810fa27480ae7e72640cc8f67b44a60f1).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by nickbp <gi...@git.apache.org>.
Github user nickbp commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r200463540
--- Diff: core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala ---
@@ -205,7 +208,7 @@ private[spark] class MetricsSystem private (
}
} catch {
case e: Exception =>
- logError("Sink class " + classPath + " cannot be instantiated")
+ logError(s"Sink class $classPath cannot be instantiated")
--- End diff --
Switched the added info logs to debug both here and in `MetricsConfig`.
I was originally thinking info level for these, just to make it easier to validate that metrics have been initialized. I could see someone wanting to determine, after the fact, why their metrics weren't forwarded as expected. But that said, they could always do that by relaunching with a debug log level.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21516
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r194690191
--- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerSource.scala ---
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.cluster.mesos
+
+import java.util.concurrent.TimeUnit
+import java.util.concurrent.atomic.AtomicBoolean
+import java.util.Date
+
+import scala.collection.mutable.HashMap
+
+import com.codahale.metrics.{Counter, Gauge, MetricRegistry, Timer}
+import org.apache.mesos.Protos.{TaskState => MesosTaskState}
+
+import org.apache.spark.TaskState
+import org.apache.spark.deploy.mesos.MesosDriverDescription
--- End diff --
Unused import.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21516
**[Test build #93657 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93657/testReport)** for PR 21516 at commit [`50c1c1e`](https://github.com/apache/spark/commit/50c1c1e810fa27480ae7e72640cc8f67b44a60f1).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21516
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93424/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/21516
Jenkins, retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by skonto <gi...@git.apache.org>.
Github user skonto commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r194679672
--- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala ---
@@ -122,8 +122,9 @@ private[spark] class MesosClusterScheduler(
conf: SparkConf)
extends Scheduler with MesosSchedulerUtils {
var frameworkUrl: String = _
+ private val metricsSource = new MesosClusterSchedulerSource(this)
--- End diff --
Shouldn't this be configurable? Is it ok to start the dispatcher by default with a metric system set up?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/21516
Jenkins, retest this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver metrics
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21516
Merged build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by nickbp <gi...@git.apache.org>.
Github user nickbp commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r200470681
--- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerSource.scala ---
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler.cluster.mesos
+
+import java.util.concurrent.TimeUnit
+import java.util.concurrent.atomic.AtomicBoolean
+import java.util.Date
+
+import scala.collection.mutable.HashMap
+
+import com.codahale.metrics.{Counter, Gauge, MetricRegistry, Timer}
--- End diff --
Cleaned up
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21516: [SPARK-24501][MESOS] Add Dispatcher and Driver me...
Posted by nickbp <gi...@git.apache.org>.
Github user nickbp commented on a diff in the pull request:
https://github.com/apache/spark/pull/21516#discussion_r200472180
--- Diff: resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala ---
@@ -796,6 +811,38 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
None
}
}
+
+ // Calls used for metrics polling, see MesosCoarseGrainedSchedulerSource:
+
+ def getCoresUsed(): Double = totalCoresAcquired
--- End diff --
In practice there shouldn't be. In the case of a race, values could be briefly out of date but they'd be fine on the following refresh. This structure mirrors the dispatcher in `MesosClusterScheduler.scala`, see e.g. `getQueuedDriversSize`, `getLaunchedDriversSize`, and `getPendingRetryDriversSize` in there.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org