You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficserver.apache.org by "James Peach (JIRA)" <ji...@apache.org> on 2016/09/08 22:17:20 UTC
[jira] [Commented] (TS-4834) Expose bad disk and disk access failures

    [ https://issues.apache.org/jira/browse/TS-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475184#comment-15475184 ] 

James Peach commented on TS-4834:
---------------------------------

[~gancho] I think the nomenclature should mirror the existing metrics which typically don't include "count" in the names. Especially since the count of bad disks (gauge) does not have the same semantics as the count of disk errors (counter).

Is "disk" the right nomenclature, or would "span" be more accurate?

How feasible is it to have separate metrics for read and write errors? Does it make sense to also have metrics for disk read and write operations, or are system tools sufficient for that?

If you have a count of bad disks, does it make sense to have other disk counts? ie. good disks, offline disks, etc.

> Expose bad disk and disk access failures
> ----------------------------------------
>
>                 Key: TS-4834
>                 URL: https://issues.apache.org/jira/browse/TS-4834
>             Project: Traffic Server
>          Issue Type: Improvement
>          Components: Cache, Metrics
>            Reporter: Gancho Tenev
>            Assignee: Gancho Tenev
>             Fix For: 7.0.0
>
>
> We would like to monitor low-level disk access failures and disks marked by ATS as bad.
> I have a patch that exposes that information through
> {code}
> proxy.process.cache.disk_error_count 10
> proxy.process.cache.disk_bad_count 5
> {code}
> and the following tests/shows how it would work...
> Start ATS with 2 disks and tail {{diags.log}}
> {code}
> $ cat etc/trafficserver/storage.config
> /dev/sdb
> /dev/sdc
> $ tail -f var/log/trafficserver/diags.log
> [Sep  8 12:18:48.149] Server {0x2b5f43db54c0} NOTE: traffic server running
> [Sep  8 12:18:48.198] Server {0x2b5f44654700} NOTE: cache enabled
> {code}
> Check related metrics and observe all 0s
> {code}
> $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" "proxy.process.cache.*(read|write).failure" "proxy.process.http.cache_(read|write)_errors"
> proxy.process.cache.disk_error_count 0
> proxy.process.cache.disk_bad_count 0
> proxy.process.cache.read.failure 0
> proxy.process.cache.write.failure 0
> proxy.process.cache.volume_0.read.failure 0
> proxy.process.cache.volume_0.write.failure 0
> proxy.process.http.cache_write_errors 0
> proxy.process.http.cache_read_errors 0
> {code}
> Now using your favorite hard disk failure injection tool inject 10 failures, by setting both disks used by this setup {{/dev/sdb}} and {{/dev/sdc}} to fail all reads. And shoot 5 requests causing 10 failed reads.
> {code}
> $ for i in 1 2 3 4 5; do curl -x 127.0.0.1:80 http://example.com/1 -o /dev/null -s; done
> $ tail -f var/log/trafficserver/diags.log
> [Sep  8 12:19:09.758] Server {0x2aaab4302700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:19:09.759] Server {0x2aaac0100700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:19:09.764] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdb [1/10]
> [Sep  8 12:19:09.769] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdb [2/10]
> [Sep  8 12:19:09.785] Server {0x2aaac0100700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:19:09.786] Server {0x2aaab4302700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:19:09.791] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdb [3/10]
> [Sep  8 12:19:09.796] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdb [4/10]
> [Sep  8 12:19:09.812] Server {0x2aaab4100700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:19:09.813] Server {0x2aaacc100700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:19:09.817] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdb [5/10]
> [Sep  8 12:19:09.823] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdb [6/10]
> [Sep  8 12:19:09.843] Server {0x2aaacc302700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:19:09.844] Server {0x2aaad8100700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:19:09.847] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdb [7/10]
> [Sep  8 12:19:09.854] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdb [8/10]
> [Sep  8 12:19:09.874] Server {0x2aaacc302700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:19:09.875] Server {0x2aaad8100700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:19:09.880] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdb [9/10]
> [Sep  8 12:19:09.887] Server {0x2b5f44654700} WARNING: too many errors accessing disk /dev/sdb [10/10]: declaring disk bad
> {code}
> We see 5 read failures which triggered 10 actually disk reads and marked the failing disk as a bad disk.
> {code}
> $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" "proxy.process.cache.*(read|write).failure" "proxy.process.http.cache_(read|write)_errors"
> proxy.process.cache.disk_error_count 10
> proxy.process.cache.disk_bad_count 1
> proxy.process.cache.read.failure 5
> proxy.process.cache.write.failure 5
> proxy.process.cache.volume_0.read.failure 5
> proxy.process.cache.volume_0.write.failure 5
> proxy.process.http.cache_write_errors 0
> proxy.process.http.cache_read_errors 0
> {code}
> Now shoot 5 requests causing 10 failed reads.
> {code}
> $ for i in 1 2 3 4 5; do curl -x 127.0.0.1:80 http://example.com/1 -o /dev/null -s; done
> $ tail -f var/log/trafficserver/diags.log
> [Sep  8 12:26:02.874] Server {0x2aaae4100700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:26:02.875] Server {0x2aaaf0302700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:26:02.876] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdc [1/10]
> [Sep  8 12:26:02.885] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdc [2/10]
> [Sep  8 12:26:02.902] Server {0x2aaaf0302700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:26:02.902] Server {0x2aaae4100700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:26:02.907] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdc [3/10]
> [Sep  8 12:26:02.914] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdc [4/10]
> [Sep  8 12:26:02.936] Server {0x2aaafc100700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:26:02.936] Server {0x2aab08100700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:26:02.940] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdc [5/10]
> [Sep  8 12:26:02.946] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdc [6/10]
> [Sep  8 12:26:02.971] Server {0x2aab08100700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:26:02.971] Server {0x2aaafc100700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:26:02.976] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdc [7/10]
> [Sep  8 12:26:02.982] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdc [8/10]
> [Sep  8 12:26:03.002] Server {0x2aaafc100700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:26:03.002] Server {0x2aab08100700} WARNING: cache disk operation failed READ -1 0
> [Sep  8 12:26:03.006] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdc [9/10]
> [Sep  8 12:26:03.012] Server {0x2b5f43db54c0} WARNING: too many errors accessing disk /dev/sdc [10/10]: declaring disk bad
> {code}
> We see 5 more read failures which triggered 10 more actual failed disk reads and marked the second disk as bad.
> {code}
> $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" "proxy.process.cache.*(read|write).failure" "proxy.process.http.cache_(read|write)_errors"
> proxy.process.cache.disk_error_count 20
> proxy.process.cache.disk_bad_count 2
> proxy.process.cache.read.failure 10
> proxy.process.cache.write.failure 10
> proxy.process.cache.volume_0.read.failure 10
> proxy.process.cache.volume_0.write.failure 10
> proxy.process.http.cache_write_errors 0
> proxy.process.http.cache_read_errors 0
> {code}
> After marking all the (2) disks as bad finally ATS disabled the cache and started going to origin on every requests.
> {code}
> $ tail -f var/log/trafficserver/diags.log
> [Sep  8 12:26:03.012] Server {0x2b5f43db54c0} WARNING: All storage devices offline, cache disabled
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)