You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@couchdb.apache.org by Apache Wiki <wi...@apache.org> on 2012/02/25 21:09:30 UTC

[Couchdb Wiki] Update of "ReleaseNotices" by JanLehnardt

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Couchdb Wiki" for change notification.

The "ReleaseNotices" page has been changed by JanLehnardt:
http://wiki.apache.org/couchdb/ReleaseNotices

Comment:
new release notices page

New page:
<<Include(EditTheWiki)>>

= Release Notices =

Sometimes, after we make a release, we might find out that something is wrong with it that is so severe that we need to tell everyone who runs that release. This page collects these notices.

<<TableOfContents(2)>>

== 1.0.0 ==

**A 1.0.0 RECOVERY TOOL IS NOW AVAILABLE**

Download the [[http://wiki.couchone.com/page/repair-tool#/|CouchDB 1.0.0 Repair Tool]] to recover data.


=== Notes on a Nasty Bug ===

Developers should be using 1.0.1 release only at this point; not the 1.0.0 version. Read on to find out why.

On the weekend of August 7th–8th, 2010 we discovered and fixed a bug in CouchDB 1.0.0. The problem was subtle (cancelling a timer, without deleting the reference to it) but the ramifications were not: there was potential data loss for users of 1.0.0. The 1.0.1 release contains a permanent fix, and [is available now on the download page](../downloads.html).

We are proud how quickly the CouchDB community recovered from this bug and went the extra mile to make sure everyone's data was safe. It is clear we have a group of developers who care enough about all users' data that it aggressively pursued an "edge case" bug so no one would be caught off guard. Further, the team worked for the next week to create a repair tool to recover access to data which was affected by the bug. As a result, no users lost data permanently. Kudos!

=== The Remedy ===

For current users, these instructions will ensure your data is safe. First: **do not restart your CouchDB!** The hot fix involves changing configuration on the running server, so have your admin credentials handy  (if your CouchDB is in Admin Party mode with no admins defined, you won't need admin credentials). (If you do not have admin credentials, but you can restart the server, you can still prevent data loss. Read on.)

==== If you have admin credentials (or if your CouchDB is in Admin Party mode) ====

Visit the Futon admin console at http://yourserver:5984/_utils/, and click "Login" in the lower right hand corner. Login as an administrator, and visit the "Configuration" page linked in the sidebar: http://yourserver:5984/_utils/config.html

Now that you are in the configuration page, set `delayed_commits` (in the `couchdb` section) to `false`. You can do this by clicking on the word `true`, and replacing it with false, and hitting enter.

The next time you write a document to each database, it will commit the header to disk, and your data will be secure. For safety, please continue with the next set of instructions.

==== For everyone ====

To ensure that each database is committed, you can use the `_ensure_full_commit` command. There are a few of ways to do this.

The simplest method is to right click the following link and add it to your bookmarks.

Bookmarklet: [[javascript:%%24.couch.allDbs%%28%%7Bsuccess%%3Afunction%%28dbs%%29%%7Bfunction%%20commitDbs%%28list%%29%%7Bvar%%20db%%3Dlist.pop%%28%%29%%3B%%24.ajax%%28%%7Btype%%3A%%22POST%%22%%2Curl%%3A%%22%%2F%%22%%2BencodeURIComponent%%28db%%29%%2B%%22%%2F_ensure_full_commit%%22%%2CcontentType%%3A%%22application%%2Fjson%%22%%2CdataType%%3A%%22json%%22%%2Ccomplete%%3Afunction%%28r%%29%%7B%%24%%28%%22%%23content%%22%%29.prepend%%28%%27%%3Cul%%20id%%3D%%22commit_all%%22%%3E%%3C%%2Ful%%3E%%27%%29%%3Bif%%28r.status%%3D%%3D201%%29%%7B%%24%%28%%22%%23commit_all%%22%%29.append%%28%%27%%3Cli%%3Ecommitted%%3A%%20%%27%%2Bdb%%2B%%27%%3C%%2Fli%%3E%%27%%29%%3B%%7Delse%%7B%%24%%28%%22%%23commit_all%%22%%29.append%%28%%27%%3Cli%%20style%%3D%%22color%%3Ared%%3B%%22%%3Eerror%%3A%%20%%27%%2Bdb%%2B%%27%%3C%%2Fli%%3E%%27%%29%%3B%%7Dif%%28list.length%%3E0%%29%%7BcommitDbs%%28list%%29%%3B%%7D%%7D%%7D%%29%%3B%%7DcommitDbs%%28dbs%%29%%3B%%7D%%7D%%29%%3B|Commit All Databases]]

Now visit Futon on your CouchDB instance at http://localhost:5984/_utils/, and select the bookmark. It will use the !JavaScript libraries included with Futon to ensure all your databases are fully committed.

Alternatively, here is a simple HTML file that you can upload to your CouchDB using Futon. When you visit it, it will make sure your data is all safely committed. If you prefer a shell script, skip below this file.

Save this HTML to a file on your machine called `commit_all.html`

{{{
    <!DOCTYPE html>
    <html>
      <head><title>Commit All Databases</title></head>
      <body>
        <h1>Commit All Databases</h1>
        <p>This script will trigger <tt>_ensure_full_commit</tt> on all databases.</p>
        <ul id="databases"></ul>
      </body>
      <script src="/_utils/script/jquery.js"></script>
      <script src="/_utils/script/jquery.couch.js"></script>
      <script>
        $.couch.allDbs({
          success : function(dbs) {
            dbs.forEach(function(db) {
              $.ajax({
                type: "POST", url: "/" + encodeURIComponent(db) + "/_ensure_full_commit",
                contentType: "application/json", dataType: "json",
                complete : function(r) {
                  if (r.status == 201) {
                    $("#databases").append('<li>committed: '+db+'</li>');
                  } else {
                    $("#databases").append('<li style="color:red;">error: '+db+'</li>');
                  }
                }
              });
            });
          }
        });
      </script>
    </html>
}}}

Now browse to your CouchDB's Futon at http://localhost:5984/_utils/ and create a database. Now visit that database, and create a document, and save it. Now click the button labeled "Upload Attachment" and choose the `commit_all.html` file you just created, and upload it. A link to that HTML file will appear in Futon.

Now click the link in Futon for `commit_all.html`, and it will run `_ensure_full_commit` on all of your databases.

If you prefer a shell script, [[http://wiki.couchone.com/page/ensure-full_commit-sh|this will also commit all your databases]].

At this point your data is safe.

==== If you don't have admin credentials ====

**Warning:** make sure you followed the instructions in the above section "For everyone" before you do the rest of these steps. If you were able to log into CouchDB as an administrator (and complete the first section, before "For Everyone") than you can skip this section.

In this step we will configure your CouchDB so that future updates will be durable.

Did you run the above HTML script? Do that now, or the next action may destroy data.

Now, find CouchDB's configuration file. It will be called `local.ini` and it is probably in a locations like: `/usr/local/etc/couchdb/local.ini`

Open the file, and add the following lines to it:

{{{
    [couchdb]
    delayed_commits = false
}}}

Now, restart your CouchDB. This will be different on different operating systems. If you have your CouchDB configured as a system service, restarting the computer will do the trick, but if you don't want to do that, you can probably find the pid of CouchDB, by running `ps ax | grep couchdb`. Once you have the pid, you can kill CouchDB by running `kill <pid>`. If you are a fan of magic, you can do all that in one ninja move by running:

{{{
      kill `ps ax | grep couchdb | head -n1 | awk '{print $1}'`
}}}

Note: you might need to sudo.

Once CouchDB is killed, the system should bring it back up. When it boots, it will load the config for `delayed_commits = false` so updates from that point forward will be durable.

=== The Bug ===

Now that we have you fixed up, you might enjoy a look at the technicalities of what got broken in CouchDB.

A commit is what causes writes to become durably flushed to storage. It is an expensive operation. During a commit, recent writes are flushed to disk and a new database header is written. Finally, the new header is also flushed to disk. At the operating system level this involves multiple fsync() calls to ensure data has been fully written.

Delayed commits are a feature of CouchDB that allows it to achieve better write performance for some workloads while sacrificing a small amount of durability. The setting causes CouchDB to wait up to a full second before committing new data after an update. If the server crashes before the header is written then any writes since the last commit are lost. The choice of delayed commits as a default has been discussed many times and the consensus was that they should remain on for the 1.0 release.

For each open database in CouchDB there is an Erlang process referred to as the update process, the source for which is in a file called `couch_db_updater.erl`. All writes to a given database pass through the corresponding update process. This process is in charge of preparing, writing and committing batches of updates. In order to provide delayed commits, the update process sets a timer for one second in the future. When the timer expires a commit message is sent back to the updater. A reference to this timer is kept in the updater state. This reference prevents the updater from scheduling excessive commit messages when one is already pending.

In the updater code that shipped with 1.0 a delayed commit message that arrived when there were no pending writes never cleared the timer reference. As a result, the updater state erroneously indicated that there was a future commit scheduled. Once in this bad state the updater would never schedule another commit. In practice, this problem occurred when a write conflict was followed by a period of inactivity. The conflicting write triggered the delayed commit, but when the commit message arrived no new data needed to be written and the timer reference was not cleared. This scenario is thankfully unlikely to occur in a busy database.

=== Mixups and Fixes ===

One can never say exactly what lead to a particular bug.  In this case, there were some contributing factors.

==== Release procedure ====

In the run-up to 1.0, there was some confusion about which branch would ultimately become 1.0. Originally we'd discussed branching 1.0 from the 0.11.x line, as 0.11 was a feature freeze release, so that we could concentrate on bugs and performance for 1.0. However, as we approached 1.0's release, there was very little work in trunk that involved new features. And the few features added to trunk were really just refinements of existing functionality, to make it more user friendly, etc.

So in the final weeks before 1.0's release, we decided to cut it from trunk (as opposed to from the 0.11.x branch) as that would make for more straightforward code management in the future. It has also been our release policy since the early days of the project.

As a result the commit that introduced the bug went into trunk when 0.11.x was still designated to become the 1.0 release with the intention to have it prove its stability before a future 1.1 release. After we decided to cut 1.0 from trunk, this commit didn't get the necessary review to stay in the 1.0 release branch.

The fix here is that we are now crystal clear that future releases will always be cut from trunk. So if people are committing stuff that they feel is not baked enough for trunk, those commits will be more likely done in a feature branch. Keeping clear about this is one way we can avoid similar issues in the future.

==== Code review ====

In the run up to 1.0, there were mailing list messages about which commits were trivial, and which needed review. In the case of the commits that weren't trivial, the original committer was the one who said he thought they were fine. In the future, for any commits to the deepest parts of the storage engine, we will be careful to have review from multiple parties. Many eyes make bugs shallow, but for code like the core CouchDB storage engine, there aren't a lot of folks who are ready to review and understand a particular patch.

==== Testing ====

CouchDB currently has a suite of unit and integration tests, which guide development and provide the first line of documentation. We also have a few independent benchmark suites, which we can use to track performance improvements and regressions.

What we don't have is a set of correctness stress tests. In this case, a fuzzing test, that applies a random set of operations to a constrained keyspace, while tracking the expected database state, and then restarting the server to make sure the state is as expected, would have caught the error.

We could learn a lot from the [[http://www.sqlite.org/testing.html|SQLite testing methodology]]. Expect to see more stress and correctness tests in CouchDB's future.