You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by "Alexander Shorin (JIRA)" <ji...@apache.org> on 2015/02/26 00:10:05 UTC
[jira] [Created] (COUCHDB-2626) Explain N last replication failures

Alexander Shorin created COUCHDB-2626:
-----------------------------------------

             Summary: Explain N last replication failures
                 Key: COUCHDB-2626
                 URL: https://issues.apache.org/jira/browse/COUCHDB-2626
             Project: CouchDB
          Issue Type: Improvement
      Security Level: public (Regular issues)
          Components: Replication
            Reporter: Alexander Shorin


It becomes a quite popular question: "I run a replication for over 9K documents, but in the end replication says that there was N document writes failures. How can I get those documents ids and the reason why?". The common answer is to parse the logs for the errors. Not cool.

The idea is to include into replication stats list of document ids which were failed to store and the reason why. This could looks like:

{code}
{
    "history": [
        {
            "doc_write_failures": 2,
            "doc_write_failures_explained": {
               "foo": {
                   "1-abc": {
                      "error": "forbidden",
                      "reason": "bad field bar"
                    },
                   "1-cde": {
                      "error": "forbidden",
                      "reason": "bad field baz"
                    },
                }
             },
            "docs_read": 10,
            "docs_written": 10,
            "end_last_seq": 28,
            "end_time": "Sun, 11 Aug 2013 20:38:50 GMT",
            "missing_checked": 10,
            "missing_found": 10,
            "recorded_seq": 28,
            "session_id": "142a35854a08e205c47174d91b1f9628",
            "start_last_seq": 1,
            "start_time": "Sun, 11 Aug 2013 20:38:50 GMT"
        }
    ],
    "ok": true,
    "replication_id_version": 3,
    "session_id": "142a35854a08e205c47174d91b1f9628",
    "source_last_seq": 28
}
{code}

E.g. just add a mapping with document ids which is a mapping of revisions to the error info.

However, we shouldn't collect all the failure explanations - you may easily have thousands failures because of bug in validate_doc_update function and in this case checkpoint documents will cause too heavy footprint. To avoid this, number of stored explanations could be limited by some configurable number, like 50, and actually keep only these N last failures. This usually enough to understand the source of problem, fix it and rerun replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)