You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by "Alexander Shorin (JIRA)" <ji...@apache.org> on 2015/02/26 00:10:05 UTC
[jira] [Created] (COUCHDB-2626) Explain N last replication failures
Alexander Shorin created COUCHDB-2626:
-----------------------------------------
Summary: Explain N last replication failures
Key: COUCHDB-2626
URL: https://issues.apache.org/jira/browse/COUCHDB-2626
Project: CouchDB
Issue Type: Improvement
Security Level: public (Regular issues)
Components: Replication
Reporter: Alexander Shorin
It becomes a quite popular question: "I run a replication for over 9K documents, but in the end replication says that there was N document writes failures. How can I get those documents ids and the reason why?". The common answer is to parse the logs for the errors. Not cool.
The idea is to include into replication stats list of document ids which were failed to store and the reason why. This could looks like:
{code}
{
"history": [
{
"doc_write_failures": 2,
"doc_write_failures_explained": {
"foo": {
"1-abc": {
"error": "forbidden",
"reason": "bad field bar"
},
"1-cde": {
"error": "forbidden",
"reason": "bad field baz"
},
}
},
"docs_read": 10,
"docs_written": 10,
"end_last_seq": 28,
"end_time": "Sun, 11 Aug 2013 20:38:50 GMT",
"missing_checked": 10,
"missing_found": 10,
"recorded_seq": 28,
"session_id": "142a35854a08e205c47174d91b1f9628",
"start_last_seq": 1,
"start_time": "Sun, 11 Aug 2013 20:38:50 GMT"
}
],
"ok": true,
"replication_id_version": 3,
"session_id": "142a35854a08e205c47174d91b1f9628",
"source_last_seq": 28
}
{code}
E.g. just add a mapping with document ids which is a mapping of revisions to the error info.
However, we shouldn't collect all the failure explanations - you may easily have thousands failures because of bug in validate_doc_update function and in this case checkpoint documents will cause too heavy footprint. To avoid this, number of stored explanations could be limited by some configurable number, like 50, and actually keep only these N last failures. This usually enough to understand the source of problem, fix it and rerun replication.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)