You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@celeborn.apache.org by an...@apache.org on 2023/03/17 08:45:32 UTC

[incubator-celeborn-website] branch CELEBORN-437 created (now 37f555b)

This is an automated email from the ASF dual-hosted git repository.

angerszhuuuu pushed a change to branch CELEBORN-437
in repository https://gitbox.apache.org/repos/asf/incubator-celeborn-website.git


      at 37f555b  [CELEBORN-437] Add doc for metrics

This branch includes the following new commits:

     new 37f555b  [CELEBORN-437] Add doc for metrics

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.

[incubator-celeborn-website] 01/01: [CELEBORN-437] Add doc for metrics

Posted by an...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

angerszhuuuu pushed a commit to branch CELEBORN-437
in repository https://gitbox.apache.org/repos/asf/incubator-celeborn-website.git

commit 37f555b786f24dcff1a75b9089a5eaad9b9d27a9
Author: Angerszhuuuu <an...@gmail.com>
AuthorDate: Fri Mar 17 16:45:26 2023 +0800

    [CELEBORN-437] Add doc for metrics
---
 docs/user_guide/monitoring.md | 202 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 187 insertions(+), 15 deletions(-)

diff --git a/docs/user_guide/monitoring.md b/docs/user_guide/monitoring.md
index 087cf87..f4af6c3 100644
--- a/docs/user_guide/monitoring.md
+++ b/docs/user_guide/monitoring.md
@@ -22,6 +22,177 @@ There are two ways to monitor Celeborn cluster: prometheus metrics and REST API.
 
 # Metrics
 
+Celeborn has a configurable metrics system based on the
+[Dropwizard Metrics Library](http://metrics.dropwizard.io/4.2.0).
+This allows users to report Celeborn metrics to a variety of sinks including HTTP, JMX, CSV
+files and prometheus servlet. The metrics are generated by sources embedded in the Celeborn code base.
+They provide instrumentation for specific activities and Celeborn components.
+The metrics system is configured via a configuration file that Celeborn expects to be present
+at `$CELEBORN_HOME/conf/metrics.properties`. A custom file location can be specified via the
+`spark.metrics.conf` [configuration property](https://celeborn.apache.org/configuration/#metrics).
+Instead of using the configuration file, a set of configuration parameters with prefix
+`celeborn.metrics.conf.` can be used.
+
+Celeborn's metrics are decoupled into wo
+_instances_ corresponding to Celeborn components.  The following instances are currently supported:
+
+* `master`: The Celeborn cluster master process.
+* `worker`: A Celeborn cluster worker process.
+
+Each instance can report to zero or more _sinks_. Sinks are contained in the
+`org.apache.celeborn.common.metrics.sink` package:
+
+* `CSVSink`: Exports metrics data to CSV files at regular intervals.
+* `PrometheusServlet`: Adds a servlet within the existing Celeborn REST API to serve metrics data in Prometheus format.
+* `GraphiteSink`: Sends metrics to a Graphite node.
+
+The syntax of the metrics configuration file and the parameters available for each sink are defined
+in an example configuration file,
+`$CELEBORN_HOME/conf/metrics.properties.template`.
+
+When using Celeborn configuration parameters instead of the metrics configuration file, the relevant
+parameter names are composed by the prefix `celeborn.metrics.conf.` followed by the configuration
+details, i.e. the parameters take the following form:
+`celeborn.metrics.conf.[instance|*].sink.[sink_name].[parameter_name]`.
+This example shows a list of Spark configuration parameters for a CSV sink:
+```
+"celeborn.metrics.conf.*.sink.csv.class"="org.apache.celeborn.common.metrics.sink.GraphiteSink"
+"celeborn.metrics.conf.*.sink.csv.period"="1"
+"celeborn.metrics.conf.*.sink.csv.unit"=minutes
+"celeborn.metrics.conf.*.sink.csv.directory"=/tmp/
+```
+
+Default values of the Celeborn metrics configuration are as follows:
+```
+*.sink.prometheusServlet.class=org.apache.celeborn.common.metrics.sink.PrometheusServlet
+```
+
+Additional sources can be configured using the metrics configuration file or the configuration
+parameter `spark.metrics.conf.[component_name].source.jvm.class=[source_name]`. At present the
+no source is the available optional source. For example the following configuration parameter
+activates the Example source:
+`"celeborn.metrics.conf.*.source.jvm.class"="org.apache.celeborn.common.metrics.source.ExampleSource"`
+
+## List of available metrics providers
+
+Metrics used by Spark are of multiple types: gauge, counter, histogram, meter and timer,
+see [Dropwizard library documentation for details](https://metrics.dropwizard.io/4.2.0/getting-started.html).
+The following list of components and metrics reports the name and some details about the available metrics,
+grouped per component instance and source namespace.
+The most common time of metrics used in Spark instrumentation are gauges and counters.
+Counters can be recognized as they have the `.count` suffix. Timers, meters and histograms are annotated
+in the list, the rest of the list elements are metrics of type gauge.
+The large majority of metrics are active as soon as their parent component instance is configured,
+some metrics require also to be enabled via an additional configuration parameter, the details are
+reported in the list.
+
+### Component instance = Master
+These metrics are exposed by Celeborn master.
+
+- namespace=master
+  - WorkerCount
+  - LostWorkers
+  - BlacklistedWorkerCount
+  - RegisteredShuffleCount
+  - IsActiveMaster
+  - PartitionSize
+  - OfferSlotsTime
+
+- namespace=CPU
+  - JVMCPUTime
+
+- namespace=JVM
+  - This source provides information on JVM metrics using the
+    [Dropwizard/Codahale Metric Sets for JVM instrumentation](https://metrics.dropwizard.io/4.2.0/manual/jvm.html)
+    and in particular the metric sets BufferPoolMetricSet, GarbageCollectorMetricSet and MemoryUsageGaugeSet.
+
+- namespace=rpc
+  - RPCHeartbeatFromApplicationNum
+  - RPCHeartbeatFromWorkerNum
+  - RPCRegisterWorkerNum
+  - RPCRequestSlotsNum
+  - RPCReleaseSlotsNum
+  - RPCReleaseSlotsSize
+  - RPCUnregisterShuffleNum
+  - RPCGetBlacklistNum
+  - RPCReportWorkerUnavailableNum
+  - RPCReportWorkerUnavailableSize
+  - RPCCheckQuotaNum
+
+- namespace=ResourceConsumption
+  - **notes:**
+    - This merics data is generated for each user and they are identified using a metric tag. 
+  - diskFileCount
+  - diskBytesWritten
+  - hdfsFileCount
+  - hdfsBytesWritten
+
+### Component instance = Worker
+These metrics are exposed by Celeborn worker.
+
+- namespace=worker
+  - CommitFilesTime
+  - ReserveSlotsTime
+  - FlushDataTime
+  - OpenStreamTime
+  - FetchChunkTime
+  - MasterPushDataTime
+  - SlavePushDataTime
+  - PushDataFailCount
+  - PushDataHandshakeFailCount
+  - RegionStartFailCount
+  - RegionFinishFailCount
+  - MasterPushDataHandshakeTime
+  - SlavePushDataHandshakeTime
+  - MasterRegionStartTime
+  - SlaveRegionStartTime
+  - MasterRegionFinishTime
+  - SlaveRegionFinishTime
+  - TakeBufferTime
+  - TakeBufferTimeIndex
+  - RegisteredShuffleCount
+  - SlotsAllocated
+  - NettyMemory
+  - SortTime
+  - SortMemory
+  - SortingFiles
+  - SortedFiles
+  - SortedFileSize
+  - DiskBuffer
+  - PausePushData
+  - PausePushDataAndReplicate
+  - BufferStreamReadBuffer
+  - ReadBufferDispatcherRequestsLength
+  - DeviceOSFreeCapacity(B)
+  - DeviceOSTotalCapacity(B)
+  - DeviceCelebornFreeCapacity(B)
+  - DeviceCelebornTotalCapacity(B)
+  - PotentialConsumeSpeed
+  - UserProduceSpeed
+
+- namespace=CPU
+  - JVMCPUTime
+
+- namespace=JVM
+  - This source provides information on JVM metrics using the
+    [Dropwizard/Codahale Metric Sets for JVM instrumentation](https://metrics.dropwizard.io/4.2.0/manual/jvm.html)
+    and in particular the metric sets BufferPoolMetricSet, GarbageCollectorMetricSet and MemoryUsageGaugeSet.
+
+- namespace=rpc
+  - RPCReserveSlotsNum
+  - RPCReserveSlotsSize
+  - RPCCommitFilesNum
+  - RPCCommitFilesSize
+  - RPCDestroyNum
+  - RPCDestroySize
+  - RPCPushDataNum
+  - RPCPushDataSize
+  - RPCPushMergedDataNum
+  - RPCPushMergedDataSize
+  - RPCOpenStreamNum
+  - RPCChunkFetchRequestNum
+
+
 # REST API
 
 In addition to viewing the metrics, Celeborn also support REST API. This gives developers
@@ -43,18 +214,19 @@ The configuration of `<master-prometheus-host>`, `<master-prometheus-port>`, `<w
 
 API path listed as below:
 
-| Path                       | Service        | Meaning                                                                                                                                                                              |
-|----------------------------|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| /conf                      | master, worker | List the conf setting of the service.                                                                                                                                                |
-| /workerInfo                | master, worker | List worker information of the service. For the master, it will list all registered workers 's information.                                                                          |
-| /lostWorkers               | master         | List all lost workers of the master.                                                                                                                                                 |
-| /blacklistedWorkers        | master         | List all  blacklisted workers of the master.                                                                                                                                         |
-| /threadDump                | master, worker | List the current thread dump of the service.                                                                                                                                         |
-| /hostnames                 | master         | List all running application's LifecycleManager's hostnames of the cluster.                                                                                                          |
-| /applications              | master         | List all running application's ids of the cluster.                                                                                                                                   |
-| /shuffles                  | master, worker | List all running shuffle keys of the service. For master, will return all running shuffle's key of the cluster, for worker, only return keys of shuffles running in that worker.     |
-| /listTopDiskUsedApps       | master, worker | List the top disk usage application ids. For master, will return the top disk usage application ids for the cluster, for worker, only return application ids running in that worker. |
-| /listPartitionLocationInfo | worker         | List all living PartitionLocation information in that worker.                                                                                                                        |
-| /unavailablePeers          | worker         | List the unavailable peers of the worker, this always means the worker connect to the peer failed.                                                                                   |
-| /isShutdown                | worker         | Show if the worker is during the process of shutdown.                                                                                                                                |
-| /isRegistered              | worker         | Show if the worker is registered to the master success.                                                                                                                              |
\ No newline at end of file
+| Path                       | Service         | Meaning                                                                                                                                                                              |
+|----------------------------|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| /metrics/prometheus        | master, worker  | List service metrics data in prometheus format.                                                                                                                                      |
+| /conf                      | master, worker  | List the conf setting of the service.                                                                                                                                                |
+| /workerInfo                | master, worker  | List worker information of the service. For the master, it will list all registered workers 's information.                                                                          |
+| /lostWorkers               | master          | List all lost workers of the master.                                                                                                                                                 |
+| /blacklistedWorkers        | master          | List all  blacklisted workers of the master.                                                                                                                                         |
+| /threadDump                | master, worker  | List the current thread dump of the service.                                                                                                                                         |
+| /hostnames                 | master          | List all running application's LifecycleManager's hostnames of the cluster.                                                                                                          |
+| /applications              | master          | List all running application's ids of the cluster.                                                                                                                                   |
+| /shuffles                  | master, worker  | List all running shuffle keys of the service. For master, will return all running shuffle's key of the cluster, for worker, only return keys of shuffles running in that worker.     |
+| /listTopDiskUsedApps       | master, worker  | List the top disk usage application ids. For master, will return the top disk usage application ids for the cluster, for worker, only return application ids running in that worker. |
+| /listPartitionLocationInfo | worker          | List all living PartitionLocation information in that worker.                                                                                                                        |
+| /unavailablePeers          | worker          | List the unavailable peers of the worker, this always means the worker connect to the peer failed.                                                                                   |
+| /isShutdown                | worker          | Show if the worker is during the process of shutdown.                                                                                                                                |
+| /isRegistered              | worker          | Show if the worker is registered to the master success.                                                                                                                              |
\ No newline at end of file