You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficserver.apache.org by "Alan M. Carroll (JIRA)" <ji...@apache.org> on 2016/08/17 17:47:21 UTC
[jira] [Commented] (TS-4242) Permanent disk failures are not handled gracefully

    [ https://issues.apache.org/jira/browse/TS-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425013#comment-15425013 ] 

Alan M. Carroll commented on TS-4242:
-------------------------------------

Yes, there is no provision that I know of to handle bad sectors. It is presumed this is done by the disk internals.

> Permanent disk failures are not handled gracefully
> --------------------------------------------------
>
>                 Key: TS-4242
>                 URL: https://issues.apache.org/jira/browse/TS-4242
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Cache
>            Reporter: Luca Bruno
>             Fix For: 7.0.0
>
>
> I'm simulating a disk failure of 1 sector with the following setup:
> {noformat}
> dd if=/dev/zero of=err.img bs=512 count=2097152
> losetup /dev/loop0 err.img
> dmsetup create err0 <<EOF
> 0 1024000 linear /dev/loop0 0
> 1024000 1 error
> 1024001 1073151 linear /dev/loop0 1024001
> EOF
> dmsetup mknodes err0
> {noformat}
> With the above command, we create a 1Gib disk, and at 500mib we simulate an error for a single 512bytes sector.
> storage.config:
> {noformat}
> /dev/mapper/err0
> {noformat}
> Now I have a tool that randomly generates urls, stores them, and requests them back with a certain probability. So that I both write and read from the disk with a certain offered/expected hit ratio.
> Once I hit the 500mib mark, trafficserver keeps spitting warnings about disk error. I fear it's because trafficserver keeps writing that bad sector instead of skipping it.
> These are the errors/warnings I'm seeing in the log repeatedly:
> {noformat}
> [Feb 29 15:29:33.308] Server {0x2ac3f1cd4700} WARNING: <AIO.cc:410 (cache_op)> cache disk operation failed WRITE -1 5
> [Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING: <Cache.cc:2089 (handle_disk_failure)> Error accessing Disk /dev/mapper/err0 [1726/100000000]
> [Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 75B41B1A2C85AE637DD6CE368BF783D0
> [Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 1075CEA6E2E47496BE190DBB448B0B64
> ...
> [Feb 29 15:29:33.284] Server {0x2ac3f28e0700} WARNING: <AIO.cc:410 (cache_op)> cache disk operation failed WRITE -1 5
> [Feb 29 15:29:33.287] Server {0x2ac3eb682700} WARNING: <Cache.cc:2089 (handle_disk_failure)> Error accessing Disk /dev/mapper/err0 [1725/100000000]
> [Feb 29 15:29:33.289] Server {0x2ac3eb682700} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 7E3325870F5488955118359E6C4B10F4
> [Feb 29 15:29:33.289] Server {0x2ac3eb27e700} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 7AE309F21ABF9B3774C67921018FCA0E
> ...
> {noformat}
> Summary: trafficserver does not treat I/O errors as permanent, but as temporary. Is this true? This leads to either:
> 1. Replace the hard disk
> 2. Use a devicemapper to skip the bad sector.
> Both cases lead to throwing away a whole disk cache of terabytes for just a bad sector.
> If this is what's really happening, is it feasible to skip the bad sector? If so, I could work on a patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)