You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficserver.apache.org by "Bryan Call (JIRA)" <ji...@apache.org> on 2016/08/17 17:50:20 UTC
[jira] [Closed] (TS-4242) Permanent disk failures are not handled gracefully

     [ https://issues.apache.org/jira/browse/TS-4242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bryan Call closed TS-4242.
--------------------------
    Resolution: Won't Fix

We discussed at the bug scrub and we don't want to deal with managing bad sector lists in traffic server.  The disk should hopefully do this and if there are failures at the application layer the disk should be replaced.

> Permanent disk failures are not handled gracefully
> --------------------------------------------------
>
>                 Key: TS-4242
>                 URL: https://issues.apache.org/jira/browse/TS-4242
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Cache
>            Reporter: Luca Bruno
>
> I'm simulating a disk failure of 1 sector with the following setup:
> {noformat}
> dd if=/dev/zero of=err.img bs=512 count=2097152
> losetup /dev/loop0 err.img
> dmsetup create err0 <<EOF
> 0 1024000 linear /dev/loop0 0
> 1024000 1 error
> 1024001 1073151 linear /dev/loop0 1024001
> EOF
> dmsetup mknodes err0
> {noformat}
> With the above command, we create a 1Gib disk, and at 500mib we simulate an error for a single 512bytes sector.
> storage.config:
> {noformat}
> /dev/mapper/err0
> {noformat}
> Now I have a tool that randomly generates urls, stores them, and requests them back with a certain probability. So that I both write and read from the disk with a certain offered/expected hit ratio.
> Once I hit the 500mib mark, trafficserver keeps spitting warnings about disk error. I fear it's because trafficserver keeps writing that bad sector instead of skipping it.
> These are the errors/warnings I'm seeing in the log repeatedly:
> {noformat}
> [Feb 29 15:29:33.308] Server {0x2ac3f1cd4700} WARNING: <AIO.cc:410 (cache_op)> cache disk operation failed WRITE -1 5
> [Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING: <Cache.cc:2089 (handle_disk_failure)> Error accessing Disk /dev/mapper/err0 [1726/100000000]
> [Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 75B41B1A2C85AE637DD6CE368BF783D0
> [Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 1075CEA6E2E47496BE190DBB448B0B64
> ...
> [Feb 29 15:29:33.284] Server {0x2ac3f28e0700} WARNING: <AIO.cc:410 (cache_op)> cache disk operation failed WRITE -1 5
> [Feb 29 15:29:33.287] Server {0x2ac3eb682700} WARNING: <Cache.cc:2089 (handle_disk_failure)> Error accessing Disk /dev/mapper/err0 [1725/100000000]
> [Feb 29 15:29:33.289] Server {0x2ac3eb682700} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 7E3325870F5488955118359E6C4B10F4
> [Feb 29 15:29:33.289] Server {0x2ac3eb27e700} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 7AE309F21ABF9B3774C67921018FCA0E
> ...
> {noformat}
> Summary: trafficserver does not treat I/O errors as permanent, but as temporary. Is this true? This leads to either:
> 1. Replace the hard disk
> 2. Use a devicemapper to skip the bad sector.
> Both cases lead to throwing away a whole disk cache of terabytes for just a bad sector.
> If this is what's really happening, is it feasible to skip the bad sector? If so, I could work on a patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)