You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by gr...@apache.org on 2018/08/15 15:21:41 UTC

kudu git commit: KUDU-2538: [docs] Document how to manually recover from Cfile corruption

Repository: kudu
Updated Branches:
  refs/heads/master e795d5a82 -> 8654a2115


KUDU-2538: [docs] Document how to manually recover from Cfile corruption

Adds troubleshooting documentation showing the
steps to manually recover from Cfile corruption.

Change-Id: Ieefd472bef104921de7cab442fd49ab32c0fe81b
Reviewed-on: http://gerrit.cloudera.org:8080/11218
Reviewed-by: Attila Bukor <ab...@apache.org>
Tested-by: Attila Bukor <ab...@apache.org>
Tested-by: Kudu Jenkins


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/8654a211
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/8654a211
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/8654a211

Branch: refs/heads/master
Commit: 8654a21159d6150c685c7311abc7c1110cda99e0
Parents: e795d5a
Author: Grant Henke <gr...@apache.org>
Authored: Tue Aug 14 16:25:09 2018 -0500
Committer: Grant Henke <gr...@apache.org>
Committed: Wed Aug 15 15:21:23 2018 +0000

----------------------------------------------------------------------
 docs/troubleshooting.adoc | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/8654a211/docs/troubleshooting.adoc
----------------------------------------------------------------------
diff --git a/docs/troubleshooting.adoc b/docs/troubleshooting.adoc
index 7c47466..90791d2 100644
--- a/docs/troubleshooting.adoc
+++ b/docs/troubleshooting.adoc
@@ -630,3 +630,43 @@ of the tablet server web UI.
 The Raft consensus algorithm that Kudu uses for replication requires tombstones
 for correctness in certain rare situations. They consume minimal resources and
 hold no data. They must not be deleted.
+
+[[cfile_corruption]]
+=== Corruption: checksum error on CFile block
+
+If the data on disk becomes corrupt, users will encounter warnings containing
+"Corruption: checksum error on CFile block" in the tablet server logs and
+client side errors when trying to scan tablets with corrupt CFile blocks.
+Until link:https://issues.apache.org/jira/browse/KUDU-2469[KUDU-2469] is
+completed, fixing this corruption is a manual process.
+
+To fix the issue, users can first identify all the affected tablets by
+running a checksum scan on the affected tables or tablets using the
+`link:command_line_tools_reference.html#cluster-ksck[ksck]` tool.
+
+----
+sudo -u kudu kudu cluster ksck <master_addresses> -checksum_scan -tables=<tables>
+sudo -u kudu kudu cluster ksck <master_addresses> -checksum_scan -tablets=<tablets>
+----
+
+If there is at least one replica for each tablet that does not return a corruption
+error, you can repair the bad copies by deleting them and forcing them to be
+re-replicated from the leader using the
+`link:command_line_tools_reference.html#remote_replica-delete[remote_replica delete] tool`.
+
+----
+sudo -u kudu kudu remote_replica delete <tserver_address> <tablet_id> "Cfile Corruption"
+----
+
+If all of the replica are corrupt, then some data loss has occurred.
+Until link:https://issues.apache.org/jira/browse/KUDU-2526[KUDU-2526] is
+completed this can happen if the corrupt replica became the leader and the
+existing follower replicas are replaced.
+
+If data has been lost, you can repair the table by replacing the corrupt tablet
+with an empty one using the
+`link:command_line_tools_reference.html#tablet-unsafe_replace_tablet[unsafe_replace_tablet]` tool.
+
+----
+sudo -u kudu kudu tablet unsafe_replace_tablet <master_addresses> <tablet_id>
+----