You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by "Alexander Shorin (JIRA)" <ji...@apache.org> on 2015/02/26 00:14:04 UTC
[jira] [Updated] (COUCHDB-2626) Explain N last replication failures

     [ https://issues.apache.org/jira/browse/COUCHDB-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexander Shorin updated COUCHDB-2626:
--------------------------------------
    Description: 
It becomes a quite popular question: "I run a replication for over 9K documents, but in the end replication says that there was N document writes failures. How can I get those documents ids and the reason why?". The common answer is to parse the logs for the errors. Not cool.

The idea is to include into replication stats list of document ids which were failed to store and the reason why. This could looks like:

{code}
{
    "history": [
        {
            "doc_write_failures": 2,
            "doc_write_failures_explained": {
               "foo": {
                   "1-abc": {
                      "error": "forbidden",
                      "reason": "bad field bar"
                    },
                   "1-cde": {
                      "error": "forbidden",
                      "reason": "bad field baz"
                    }
                }
             },
            "docs_read": 10,
            "docs_written": 10,
            "end_last_seq": 28,
            "end_time": "Sun, 11 Aug 2013 20:38:50 GMT",
            "missing_checked": 10,
            "missing_found": 10,
            "recorded_seq": 28,
            "session_id": "142a35854a08e205c47174d91b1f9628",
            "start_last_seq": 1,
            "start_time": "Sun, 11 Aug 2013 20:38:50 GMT"
        }
    ],
    "ok": true,
    "replication_id_version": 3,
    "session_id": "142a35854a08e205c47174d91b1f9628",
    "source_last_seq": 28
}
{code}

E.g. just add a mapping with document ids which is a mapping of revisions to the error info.

However, we shouldn't collect all the failure explanations - you may easily have thousands failures because of bug in validate_doc_update function and in this case checkpoint documents will cause too heavy footprint. To avoid this, number of stored explanations could be limited by some configurable number, like 50, and actually keep only these N last failures. This usually enough to understand the source of problem, fix it and rerun replication.

  was:
It becomes a quite popular question: "I run a replication for over 9K documents, but in the end replication says that there was N document writes failures. How can I get those documents ids and the reason why?". The common answer is to parse the logs for the errors. Not cool.

The idea is to include into replication stats list of document ids which were failed to store and the reason why. This could looks like:

{code}
{
    "history": [
        {
            "doc_write_failures": 2,
            "doc_write_failures_explained": {
               "foo": {
                   "1-abc": {
                      "error": "forbidden",
                      "reason": "bad field bar"
                    },
                   "1-cde": {
                      "error": "forbidden",
                      "reason": "bad field baz"
                    },
                }
             },
            "docs_read": 10,
            "docs_written": 10,
            "end_last_seq": 28,
            "end_time": "Sun, 11 Aug 2013 20:38:50 GMT",
            "missing_checked": 10,
            "missing_found": 10,
            "recorded_seq": 28,
            "session_id": "142a35854a08e205c47174d91b1f9628",
            "start_last_seq": 1,
            "start_time": "Sun, 11 Aug 2013 20:38:50 GMT"
        }
    ],
    "ok": true,
    "replication_id_version": 3,
    "session_id": "142a35854a08e205c47174d91b1f9628",
    "source_last_seq": 28
}
{code}

E.g. just add a mapping with document ids which is a mapping of revisions to the error info.

However, we shouldn't collect all the failure explanations - you may easily have thousands failures because of bug in validate_doc_update function and in this case checkpoint documents will cause too heavy footprint. To avoid this, number of stored explanations could be limited by some configurable number, like 50, and actually keep only these N last failures. This usually enough to understand the source of problem, fix it and rerun replication.


> Explain N last replication failures
> -----------------------------------
>
>                 Key: COUCHDB-2626
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2626
>             Project: CouchDB
>          Issue Type: Improvement
>      Security Level: public(Regular issues) 
>          Components: Replication
>            Reporter: Alexander Shorin
>
> It becomes a quite popular question: "I run a replication for over 9K documents, but in the end replication says that there was N document writes failures. How can I get those documents ids and the reason why?". The common answer is to parse the logs for the errors. Not cool.
> The idea is to include into replication stats list of document ids which were failed to store and the reason why. This could looks like:
> {code}
> {
>     "history": [
>         {
>             "doc_write_failures": 2,
>             "doc_write_failures_explained": {
>                "foo": {
>                    "1-abc": {
>                       "error": "forbidden",
>                       "reason": "bad field bar"
>                     },
>                    "1-cde": {
>                       "error": "forbidden",
>                       "reason": "bad field baz"
>                     }
>                 }
>              },
>             "docs_read": 10,
>             "docs_written": 10,
>             "end_last_seq": 28,
>             "end_time": "Sun, 11 Aug 2013 20:38:50 GMT",
>             "missing_checked": 10,
>             "missing_found": 10,
>             "recorded_seq": 28,
>             "session_id": "142a35854a08e205c47174d91b1f9628",
>             "start_last_seq": 1,
>             "start_time": "Sun, 11 Aug 2013 20:38:50 GMT"
>         }
>     ],
>     "ok": true,
>     "replication_id_version": 3,
>     "session_id": "142a35854a08e205c47174d91b1f9628",
>     "source_last_seq": 28
> }
> {code}
> E.g. just add a mapping with document ids which is a mapping of revisions to the error info.
> However, we shouldn't collect all the failure explanations - you may easily have thousands failures because of bug in validate_doc_update function and in this case checkpoint documents will cause too heavy footprint. To avoid this, number of stored explanations could be limited by some configurable number, like 50, and actually keep only these N last failures. This usually enough to understand the source of problem, fix it and rerun replication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)