You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2017/01/18 19:30:26 UTC

[jira] [Created] (KUDU-1840) Tolerate disk failures on single tablet servers

Jean-Daniel Cryans created KUDU-1840:
----------------------------------------

             Summary: Tolerate disk failures on single tablet servers
                 Key: KUDU-1840
                 URL: https://issues.apache.org/jira/browse/KUDU-1840
             Project: Kudu
          Issue Type: Improvement
          Components: fs
            Reporter: Jean-Daniel Cryans


The way we store data on disk is akin to striping or RAID 0, losing one disk means that the rest of the data isn't recoverable on the other disks.

Users would see something like after replacing a bad disk:

{noformat}
an 18, 10:20:55.693 AM  INFO  server_base.cc:179  
Could not load existing FS layout: Not found: /data/4/kudu/instance: No such file or directory (error 2)
Jan 18, 10:20:55.693 AM  INFO  server_base.cc:180  
Creating new FS layout
Jan 18, 10:20:55.693 AM  FATAL  tablet_server_main.cc:64  
Check failed: _s.ok() Bad status: Already present: Could not create new FS layout: FSManager root is not empty: /data/1/kudu-wal
{noformat}

The above shows a tablet server figuring out that one folder is empty, but then that other folders have data so it crashes. Currently the workaround is to manually delete the data in all the remaining Kudu folders.

As we fix this, one thing to keep in mind is that WALs can only be stored on one disk, so even if we tolerate data disk failures it would still not help if the WALs' disk dies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)