You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Damien Katz <da...@apache.org> on 2010/08/08 02:16:33 UTC

Re: Data loss

I reproduced this manually:

Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).

Attempt to create document "x" again, get conflict error.

Wait at least 2 secs to ensure the delayed commit attempt happens.

Now create document "y".

Wait at least 2 secs because the delayed commit should happen

Restart server.

Document "y" is now missing.

The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.

-Damien

On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:

> I believe it's a single delayed conflict write attempt and no successes in that same interval.
> 
> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
> 
>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>> 
>> 1.0 loses data. This is ridiculously bad.
>> 
>> We need a test to reproduce this and fix.
>> 
>> -Damien
>> 
>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>> 
>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>> 
>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>> 
>>> Adam
>>> 
>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>> 
>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>> 
>>>> Try to reproduce this and add a test case.
>>>> 
>>>> -Damien
>>>> 
>>>> 
>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>> 
>>>>> I think you may be right, Damien.
>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>> this what you're thinking? This would be the fix:
>>>>> http://gist.github.com/513282
>>>>> 
>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>> 
>>>>>> -Damien
>>>>>> 
>>>>>> 
>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>> 
>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>> imagine how it's a problem.
>>>>>>>> 
>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>> 
>>>>>>>>> -Damien
>>>>>>>> 
>>>>>>> 
>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>> least one situation where we do not commit data correctly during
>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>> http://gist.github.com/513081
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 
>

Re: Data loss

Posted by Eric Carlson <er...@ericcarlson.co.uk>.

 Just tried it, and I can confirm that the new futon test fails without
the patch but succeeds with it. Also, Damien's manual method of
reproducing the problem loses data without the patch but everything
seems to work correctly with the patch.

-Eric

On 08/08/10 01:33, Randall Leeds wrote:
> http://github.com/tilgovi/couchdb/tree/fixlostcommits
>
> Test and fix in separate commits at the end of that branch, based off
> current trunk.
> Would appreciate verification that the test is initially broken but
> fixed by the patch.
>
> On Sat, Aug 7, 2010 at 17:16, Damien Katz <da...@apache.org> wrote:
>> I reproduced this manually:
>>
>> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).
>>
>> Attempt to create document "x" again, get conflict error.
>>
>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>>
>> Now create document "y".
>>
>> Wait at least 2 secs because the delayed commit should happen
>>
>> Restart server.
>>
>> Document "y" is now missing.
>>
>> The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.
>>
>> -Damien
>>
>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>>
>>> I believe it's a single delayed conflict write attempt and no successes in that same interval.
>>>
>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>>
>>>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>>>>
>>>> 1.0 loses data. This is ridiculously bad.
>>>>
>>>> We need a test to reproduce this and fix.
>>>>
>>>> -Damien
>>>>
>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>>
>>>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>>>>
>>>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>>>>
>>>>> Adam
>>>>>
>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>>
>>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>>>>
>>>>>> Try to reproduce this and add a test case.
>>>>>>
>>>>>> -Damien
>>>>>>
>>>>>>
>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>>
>>>>>>> I think you may be right, Damien.
>>>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>>>> this what you're thinking? This would be the fix:
>>>>>>> http://gist.github.com/513282
>>>>>>>
>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>>>>
>>>>>>>> -Damien
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>>
>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>>
>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>>
>>>>>>>>>>> -Damien
>>>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>>>> least one situation where we do not commit data correctly during
>>>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>> http://gist.github.com/513081
>>>>>>>>
>>

Re: Data loss

Posted by Randall Leeds <ra...@gmail.com>.

On Sat, Aug 7, 2010 at 18:01, Adam Kocoloski <ko...@apache.org> wrote:
> POSTing to /db/_ensure_full_commit will still cause a header to be written.
>
> Switching to delayed_commits = false and then writing a document will cause a header to be written for that DB.
>
> POSTing to /_ensure_full_commit for each DB and then flipping the delayed_commits to false will put a 1.0.0 server into a safe state with all data saved.

The safest, I think, would be to flip to delayed_commits=false first
and then post to /_ensure_full_commit on each DB.

Re: Data loss

Posted by Adam Kocoloski <ko...@apache.org>.

POSTing to /db/_ensure_full_commit will still cause a header to be written.

Switching to delayed_commits = false and then writing a document will cause a header to be written for that DB.

POSTing to /_ensure_full_commit for each DB and then flipping the delayed_commits to false will put a 1.0.0 server into a safe state with all data saved.

Adam

On Aug 7, 2010, at 8:57 PM, Chris Anderson wrote:

> Will switching a running 1.0 server to delayed_commits=true cause the noncommitted headers to be written? Are there other remedies for folks with critical data in 1.0 who want to ensure they are safe?
> 
> Chris
> 
> Typed on glass.
> 
> On Aug 7, 2010, at 5:47 PM, Adam Kocoloski <ko...@apache.org> wrote:
> 
>> Committed to trunk and 1.0.x.
>> 
>> On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote:
>> 
>>> http://github.com/tilgovi/couchdb/tree/fixlostcommits
>>> 
>>> Test and fix in separate commits at the end of that branch, based off
>>> current trunk.
>>> Would appreciate verification that the test is initially broken but
>>> fixed by the patch.
>>> 
>>> On Sat, Aug 7, 2010 at 17:16, Damien Katz <da...@apache.org> wrote:
>>>> I reproduced this manually:
>>>> 
>>>> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).
>>>> 
>>>> Attempt to create document "x" again, get conflict error.
>>>> 
>>>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>>>> 
>>>> Now create document "y".
>>>> 
>>>> Wait at least 2 secs because the delayed commit should happen
>>>> 
>>>> Restart server.
>>>> 
>>>> Document "y" is now missing.
>>>> 
>>>> The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.
>>>> 
>>>> -Damien
>>>> 
>>>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>>>> 
>>>>> I believe it's a single delayed conflict write attempt and no successes in that same interval.
>>>>> 
>>>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>>>> 
>>>>>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>>>>>> 
>>>>>> 1.0 loses data. This is ridiculously bad.
>>>>>> 
>>>>>> We need a test to reproduce this and fix.
>>>>>> 
>>>>>> -Damien
>>>>>> 
>>>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>>>> 
>>>>>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>>>>>> 
>>>>>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>>>>>> 
>>>>>>> Adam
>>>>>>> 
>>>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>>>> 
>>>>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>>>>>> 
>>>>>>>> Try to reproduce this and add a test case.
>>>>>>>> 
>>>>>>>> -Damien
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>>>> 
>>>>>>>>> I think you may be right, Damien.
>>>>>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>>>>>> this what you're thinking? This would be the fix:
>>>>>>>>> http://gist.github.com/513282
>>>>>>>>> 
>>>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>>>>>> 
>>>>>>>>>> -Damien
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>>>> 
>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Damien
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>>>>>> least one situation where we do not commit data correctly during
>>>>>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>>>> http://gist.github.com/513081
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>

Re: Data loss

Posted by Chris Anderson <jc...@gmail.com>.

Will switching a running 1.0 server to delayed_commits=true cause the noncommitted headers to be written? Are there other remedies for folks with critical data in 1.0 who want to ensure they are safe?

Chris

Typed on glass.

On Aug 7, 2010, at 5:47 PM, Adam Kocoloski <ko...@apache.org> wrote:

> Committed to trunk and 1.0.x.
> 
> On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote:
> 
>> http://github.com/tilgovi/couchdb/tree/fixlostcommits
>> 
>> Test and fix in separate commits at the end of that branch, based off
>> current trunk.
>> Would appreciate verification that the test is initially broken but
>> fixed by the patch.
>> 
>> On Sat, Aug 7, 2010 at 17:16, Damien Katz <da...@apache.org> wrote:
>>> I reproduced this manually:
>>> 
>>> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).
>>> 
>>> Attempt to create document "x" again, get conflict error.
>>> 
>>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>>> 
>>> Now create document "y".
>>> 
>>> Wait at least 2 secs because the delayed commit should happen
>>> 
>>> Restart server.
>>> 
>>> Document "y" is now missing.
>>> 
>>> The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.
>>> 
>>> -Damien
>>> 
>>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>>> 
>>>> I believe it's a single delayed conflict write attempt and no successes in that same interval.
>>>> 
>>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>>> 
>>>>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>>>>> 
>>>>> 1.0 loses data. This is ridiculously bad.
>>>>> 
>>>>> We need a test to reproduce this and fix.
>>>>> 
>>>>> -Damien
>>>>> 
>>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>>> 
>>>>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>>>>> 
>>>>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>>>>> 
>>>>>> Adam
>>>>>> 
>>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>>> 
>>>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>>>>> 
>>>>>>> Try to reproduce this and add a test case.
>>>>>>> 
>>>>>>> -Damien
>>>>>>> 
>>>>>>> 
>>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>>> 
>>>>>>>> I think you may be right, Damien.
>>>>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>>>>> this what you're thinking? This would be the fix:
>>>>>>>> http://gist.github.com/513282
>>>>>>>> 
>>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>>>>> 
>>>>>>>>> -Damien
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>>> 
>>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Damien
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>>>>> least one situation where we do not commit data correctly during
>>>>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>>> http://gist.github.com/513081
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>

Re: Data loss

Posted by Jan Lehnardt <ja...@apache.org>.

On 8 Aug 2010, at 13:48, Noah Slater wrote:

> Do we need to abort 0.11.2 as well?
> 
> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
> 
>> 
>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>> 
>>> 
>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>> 
>>>> is this serious enough to justify pulling current 1.0.0 release
>>>> binaries to avoid further installs putting data at risk?
>>>> 
>>> 
>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do. 
>> 
>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>> 
>> 
>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>> 
>> +1.
>> 
>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>> 
>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>> 
>> I think so, too.
>> 
>> Cheers
>> Jan
>> --
>> 
>>> 
>>> 
>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>> Yes. Adam already back ported it.
>>>>> 
>>>>> Sent from my interstellar unicorn.
>>>>> 
>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>> 
>>>>> Time to abort the vote then?
>>>>> 
>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>> 
>>>>> 
>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> Anyone up to create a repair tool for w...
>>>>> 
>>> 
>> 
>

[NOTICE] Data loss bug (and fix)

Posted by J Chris Anderson <jc...@apache.org>.

Over the weekend of August 7th–8th, 2010 we discovered and fixed a nasty bug in CouchDB 1.0.0. There is potential data-loss for users of 1.0.0, running with the default configuration of delayed_commits=true.

We've issued an in-place fix and details about the data loss bug here:

http://couchdb.apache.org/notice/1.0.1.html

The 1.0.1 release will make a permanent fix, but in the mean time, following these instructions will ensure your data is safe.

Chris

[NOTICE] Data loss bug (and fix)

Posted by J Chris Anderson <jc...@apache.org>.

Over the weekend of August 7th–8th, 2010 we discovered and fixed a nasty bug in CouchDB 1.0.0. There is potential data-loss for users of 1.0.0, running with the default configuration of delayed_commits=true.

We've issued an in-place fix and details about the data loss bug here:

http://couchdb.apache.org/notice/1.0.1.html

The 1.0.1 release will make a permanent fix, but in the mean time, following these instructions will ensure your data is safe.

Chris

Re: Data loss

Posted by Jan Lehnardt <ja...@apache.org>.

On 8 Aug 2010, at 21:49, Noah Slater wrote:

> Done.
> 
> The public site should update within the hour.
> 
> The official distribution directory no longer has 1.0.0, but the mirrors will for another 24 hours.

Randall was so kind to update the technical details in Chris's wiki page. I took the liberty (and help from Noah) to add it on the site under notice/1.0.1.html (as a release notice for the upcoming 1.0.1 release. I also updated the downloads page to point to the notice. It'll be up with in the hour (or two).

Thanks again all for getting this resolved so quickly. The team spirit here really makes this a fun project :)

Cheers
Jan
-- 

> 
> On 8 Aug 2010, at 20:43, Jan Lehnardt wrote:
> 
>> 
>> On 8 Aug 2010, at 21:24, Noah Slater wrote:
>> 
>>> What you are suggesting isn archival of the release, which means removing it from the downloads page, the distribution directory, and the mirrors. I can do this, but I'd like to know that we have consensus first. The plan as I understood it was to archive this release at the same time as making the 1.0.1 release.
>> 
>> I'd like to follow that plan.
>> 
>> Cheers
>> Jan
>> -- 
>> 
>>> 
>>> On 8 Aug 2010, at 20:21, Robert Dionne wrote:
>>> 
>>>> I would also consider removing the download link for 1.0.0 and not depend on users patching it. It's broken.
>>>> 
>>>> I have to believe there are users who won't and who won't read the red sign. There's a good probability these are the kinds of users who will also be the most upset by data loss
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Aug 8, 2010, at 3:06 PM, Jan Lehnardt wrote:
>>>> 
>>>>> 
>>>>> On 8 Aug 2010, at 18:37, J Chris Anderson wrote:
>>>>> 
>>>>>> Devs,
>>>>>> 
>>>>>> I have started a document which we will use when announcing the bug. I plan to move the document from this wiki location to the http://couchdb.apache.org site before the end of the day. Please review and edit the document before then.
>>>>>> 
>>>>>> http://wiki.couchone.com/page/post-mortem
>>>>>> 
>>>>>> I have a section called "The Bug" which needs a technical description of the error and the fix. I'm hoping Adam or Randall can write this, as they are most familiar with the issues.
>>>>>> 
>>>>>> Once it is ready, we should do our best to make sure our users get a chance to read it.
>>>>> 
>>>>> I made a few more minor adjustments (see page history when you are logged in) and have nothing more to add myself, but I'd appreciate if Adam or Randall could add a few more tech bits.
>>>>> 
>>>>> --
>>>>> 
>>>>> In the meantime, I've put up a BIG FAT WARNING on the CouchDB downloads page:  
>>>>> 
>>>>> http://couchdb.apache.org/downloads.html
>>>>> 
>>>>> I plan to update the warning with a link to the post-mortem once that is done.
>>>>> 
>>>>> --
>>>>> 
>>>>> Thanks everybody for being on top of this!
>>>>> 
>>>>> Cheers
>>>>> Jan
>>>>> -- 
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> Chris
>>>>>> 
>>>>>> On Aug 8, 2010, at 5:16 AM, Robert Newson wrote:
>>>>>> 
>>>>>>> That was also Adam's conclusion (data loss bug confined to 1.0.0).
>>>>>>> 
>>>>>>> B.
>>>>>>> 
>>>>>>> On Sun, Aug 8, 2010 at 1:10 PM, Jan Lehnardt <ja...@apache.org> wrote:
>>>>>>>> 
>>>>>>>> On 8 Aug 2010, at 13:48, Noah Slater wrote:
>>>>>>>> 
>>>>>>>>> Do we need to abort 0.11.2 as well?
>>>>>>>> 
>>>>>>>> 0.11.x does not have this commit as far as I can see.
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> Jan
>>>>>>>> --
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> is this serious enough to justify pulling current 1.0.0 release
>>>>>>>>>>>> binaries to avoid further installs putting data at risk?
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do.
>>>>>>>>>> 
>>>>>>>>>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>>>>>>>>>> 
>>>>>>>>>> +1.
>>>>>>>>>> 
>>>>>>>>>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>>>>>>>>>> 
>>>>>>>>>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>>>>>>>>>> 
>>>>>>>>>> I think so, too.
>>>>>>>>>> 
>>>>>>>>>> Cheers
>>>>>>>>>> Jan
>>>>>>>>>> --
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>>>>> Yes. Adam already back ported it.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Sent from my interstellar unicorn.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Time to abort the vote then?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Anyone up to create a repair tool for w...
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Data loss

Posted by Noah Slater <ns...@apache.org>.

Done.

The public site should update within the hour.

The official distribution directory no longer has 1.0.0, but the mirrors will for another 24 hours.

On 8 Aug 2010, at 20:43, Jan Lehnardt wrote:

> 
> On 8 Aug 2010, at 21:24, Noah Slater wrote:
> 
>> What you are suggesting isn archival of the release, which means removing it from the downloads page, the distribution directory, and the mirrors. I can do this, but I'd like to know that we have consensus first. The plan as I understood it was to archive this release at the same time as making the 1.0.1 release.
> 
> I'd like to follow that plan.
> 
> Cheers
> Jan
> -- 
> 
>> 
>> On 8 Aug 2010, at 20:21, Robert Dionne wrote:
>> 
>>> I would also consider removing the download link for 1.0.0 and not depend on users patching it. It's broken.
>>> 
>>> I have to believe there are users who won't and who won't read the red sign. There's a good probability these are the kinds of users who will also be the most upset by data loss
>>> 
>>> 
>>> 
>>> 
>>> On Aug 8, 2010, at 3:06 PM, Jan Lehnardt wrote:
>>> 
>>>> 
>>>> On 8 Aug 2010, at 18:37, J Chris Anderson wrote:
>>>> 
>>>>> Devs,
>>>>> 
>>>>> I have started a document which we will use when announcing the bug. I plan to move the document from this wiki location to the http://couchdb.apache.org site before the end of the day. Please review and edit the document before then.
>>>>> 
>>>>> http://wiki.couchone.com/page/post-mortem
>>>>> 
>>>>> I have a section called "The Bug" which needs a technical description of the error and the fix. I'm hoping Adam or Randall can write this, as they are most familiar with the issues.
>>>>> 
>>>>> Once it is ready, we should do our best to make sure our users get a chance to read it.
>>>> 
>>>> I made a few more minor adjustments (see page history when you are logged in) and have nothing more to add myself, but I'd appreciate if Adam or Randall could add a few more tech bits.
>>>> 
>>>> --
>>>> 
>>>> In the meantime, I've put up a BIG FAT WARNING on the CouchDB downloads page:  
>>>> 
>>>> http://couchdb.apache.org/downloads.html
>>>> 
>>>> I plan to update the warning with a link to the post-mortem once that is done.
>>>> 
>>>> --
>>>> 
>>>> Thanks everybody for being on top of this!
>>>> 
>>>> Cheers
>>>> Jan
>>>> -- 
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Chris
>>>>> 
>>>>> On Aug 8, 2010, at 5:16 AM, Robert Newson wrote:
>>>>> 
>>>>>> That was also Adam's conclusion (data loss bug confined to 1.0.0).
>>>>>> 
>>>>>> B.
>>>>>> 
>>>>>> On Sun, Aug 8, 2010 at 1:10 PM, Jan Lehnardt <ja...@apache.org> wrote:
>>>>>>> 
>>>>>>> On 8 Aug 2010, at 13:48, Noah Slater wrote:
>>>>>>> 
>>>>>>>> Do we need to abort 0.11.2 as well?
>>>>>>> 
>>>>>>> 0.11.x does not have this commit as far as I can see.
>>>>>>> 
>>>>>>> Cheers
>>>>>>> Jan
>>>>>>> --
>>>>>>> 
>>>>>>>> 
>>>>>>>> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>>>>>>>>> 
>>>>>>>>>>> is this serious enough to justify pulling current 1.0.0 release
>>>>>>>>>>> binaries to avoid further installs putting data at risk?
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do.
>>>>>>>>> 
>>>>>>>>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>>>>>>>>> 
>>>>>>>>> +1.
>>>>>>>>> 
>>>>>>>>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>>>>>>>>> 
>>>>>>>>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>>>>>>>>> 
>>>>>>>>> I think so, too.
>>>>>>>>> 
>>>>>>>>> Cheers
>>>>>>>>> Jan
>>>>>>>>> --
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>>>> Yes. Adam already back ported it.
>>>>>>>>>>>> 
>>>>>>>>>>>> Sent from my interstellar unicorn.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Time to abort the vote then?
>>>>>>>>>>>> 
>>>>>>>>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Anyone up to create a repair tool for w...
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Data loss

Posted by Jan Lehnardt <ja...@apache.org>.

On 8 Aug 2010, at 21:24, Noah Slater wrote:

> What you are suggesting isn archival of the release, which means removing it from the downloads page, the distribution directory, and the mirrors. I can do this, but I'd like to know that we have consensus first. The plan as I understood it was to archive this release at the same time as making the 1.0.1 release.

I'd like to follow that plan.

Cheers
Jan
-- 

> 
> On 8 Aug 2010, at 20:21, Robert Dionne wrote:
> 
>> I would also consider removing the download link for 1.0.0 and not depend on users patching it. It's broken.
>> 
>> I have to believe there are users who won't and who won't read the red sign. There's a good probability these are the kinds of users who will also be the most upset by data loss
>> 
>> 
>> 
>> 
>> On Aug 8, 2010, at 3:06 PM, Jan Lehnardt wrote:
>> 
>>> 
>>> On 8 Aug 2010, at 18:37, J Chris Anderson wrote:
>>> 
>>>> Devs,
>>>> 
>>>> I have started a document which we will use when announcing the bug. I plan to move the document from this wiki location to the http://couchdb.apache.org site before the end of the day. Please review and edit the document before then.
>>>> 
>>>> http://wiki.couchone.com/page/post-mortem
>>>> 
>>>> I have a section called "The Bug" which needs a technical description of the error and the fix. I'm hoping Adam or Randall can write this, as they are most familiar with the issues.
>>>> 
>>>> Once it is ready, we should do our best to make sure our users get a chance to read it.
>>> 
>>> I made a few more minor adjustments (see page history when you are logged in) and have nothing more to add myself, but I'd appreciate if Adam or Randall could add a few more tech bits.
>>> 
>>> --
>>> 
>>> In the meantime, I've put up a BIG FAT WARNING on the CouchDB downloads page:  
>>> 
>>> http://couchdb.apache.org/downloads.html
>>> 
>>> I plan to update the warning with a link to the post-mortem once that is done.
>>> 
>>> --
>>> 
>>> Thanks everybody for being on top of this!
>>> 
>>> Cheers
>>> Jan
>>> -- 
>>> 
>>> 
>>> 
>>>> 
>>>> Thanks,
>>>> Chris
>>>> 
>>>> On Aug 8, 2010, at 5:16 AM, Robert Newson wrote:
>>>> 
>>>>> That was also Adam's conclusion (data loss bug confined to 1.0.0).
>>>>> 
>>>>> B.
>>>>> 
>>>>> On Sun, Aug 8, 2010 at 1:10 PM, Jan Lehnardt <ja...@apache.org> wrote:
>>>>>> 
>>>>>> On 8 Aug 2010, at 13:48, Noah Slater wrote:
>>>>>> 
>>>>>>> Do we need to abort 0.11.2 as well?
>>>>>> 
>>>>>> 0.11.x does not have this commit as far as I can see.
>>>>>> 
>>>>>> Cheers
>>>>>> Jan
>>>>>> --
>>>>>> 
>>>>>>> 
>>>>>>> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>>>>>>>> 
>>>>>>>>>> is this serious enough to justify pulling current 1.0.0 release
>>>>>>>>>> binaries to avoid further installs putting data at risk?
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do.
>>>>>>>> 
>>>>>>>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>>>>>>>> 
>>>>>>>> +1.
>>>>>>>> 
>>>>>>>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>>>>>>>> 
>>>>>>>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>>>>>>>> 
>>>>>>>> I think so, too.
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> Jan
>>>>>>>> --
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>>> Yes. Adam already back ported it.
>>>>>>>>>>> 
>>>>>>>>>>> Sent from my interstellar unicorn.
>>>>>>>>>>> 
>>>>>>>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Time to abort the vote then?
>>>>>>>>>>> 
>>>>>>>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>> 
>>>>>>>>>>>> Anyone up to create a repair tool for w...
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 
>

Re: Data loss

Posted by Noah Slater <ns...@apache.org>.

What you are suggesting isn archival of the release, which means removing it from the downloads page, the distribution directory, and the mirrors. I can do this, but I'd like to know that we have consensus first. The plan as I understood it was to archive this release at the same time as making the 1.0.1 release.

On 8 Aug 2010, at 20:21, Robert Dionne wrote:

> I would also consider removing the download link for 1.0.0 and not depend on users patching it. It's broken.
> 
> I have to believe there are users who won't and who won't read the red sign. There's a good probability these are the kinds of users who will also be the most upset by data loss
> 
> 
> 
> 
> On Aug 8, 2010, at 3:06 PM, Jan Lehnardt wrote:
> 
>> 
>> On 8 Aug 2010, at 18:37, J Chris Anderson wrote:
>> 
>>> Devs,
>>> 
>>> I have started a document which we will use when announcing the bug. I plan to move the document from this wiki location to the http://couchdb.apache.org site before the end of the day. Please review and edit the document before then.
>>> 
>>> http://wiki.couchone.com/page/post-mortem
>>> 
>>> I have a section called "The Bug" which needs a technical description of the error and the fix. I'm hoping Adam or Randall can write this, as they are most familiar with the issues.
>>> 
>>> Once it is ready, we should do our best to make sure our users get a chance to read it.
>> 
>> I made a few more minor adjustments (see page history when you are logged in) and have nothing more to add myself, but I'd appreciate if Adam or Randall could add a few more tech bits.
>> 
>> --
>> 
>> In the meantime, I've put up a BIG FAT WARNING on the CouchDB downloads page:  
>> 
>> http://couchdb.apache.org/downloads.html
>> 
>> I plan to update the warning with a link to the post-mortem once that is done.
>> 
>> --
>> 
>> Thanks everybody for being on top of this!
>> 
>> Cheers
>> Jan
>> -- 
>> 
>> 
>> 
>>> 
>>> Thanks,
>>> Chris
>>> 
>>> On Aug 8, 2010, at 5:16 AM, Robert Newson wrote:
>>> 
>>>> That was also Adam's conclusion (data loss bug confined to 1.0.0).
>>>> 
>>>> B.
>>>> 
>>>> On Sun, Aug 8, 2010 at 1:10 PM, Jan Lehnardt <ja...@apache.org> wrote:
>>>>> 
>>>>> On 8 Aug 2010, at 13:48, Noah Slater wrote:
>>>>> 
>>>>>> Do we need to abort 0.11.2 as well?
>>>>> 
>>>>> 0.11.x does not have this commit as far as I can see.
>>>>> 
>>>>> Cheers
>>>>> Jan
>>>>> --
>>>>> 
>>>>>> 
>>>>>> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>>>>>>> 
>>>>>>>>> is this serious enough to justify pulling current 1.0.0 release
>>>>>>>>> binaries to avoid further installs putting data at risk?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do.
>>>>>>> 
>>>>>>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>>>>>>> 
>>>>>>> 
>>>>>>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>>>>>>> 
>>>>>>> +1.
>>>>>>> 
>>>>>>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>>>>>>> 
>>>>>>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>>>>>>> 
>>>>>>> I think so, too.
>>>>>>> 
>>>>>>> Cheers
>>>>>>> Jan
>>>>>>> --
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>> Yes. Adam already back ported it.
>>>>>>>>>> 
>>>>>>>>>> Sent from my interstellar unicorn.
>>>>>>>>>> 
>>>>>>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>>>>>>> 
>>>>>>>>>> Time to abort the vote then?
>>>>>>>>>> 
>>>>>>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks.
>>>>>>>>>>> 
>>>>>>>>>>> Anyone up to create a repair tool for w...
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>

Re: Data loss

Posted by Robert Dionne <di...@dionne-associates.com>.

I would also consider removing the download link for 1.0.0 and not depend on users patching it. It's broken.

I have to believe there are users who won't and who won't read the red sign. There's a good probability these are the kinds of users who will also be the most upset by data loss




On Aug 8, 2010, at 3:06 PM, Jan Lehnardt wrote:

> 
> On 8 Aug 2010, at 18:37, J Chris Anderson wrote:
> 
>> Devs,
>> 
>> I have started a document which we will use when announcing the bug. I plan to move the document from this wiki location to the http://couchdb.apache.org site before the end of the day. Please review and edit the document before then.
>> 
>> http://wiki.couchone.com/page/post-mortem
>> 
>> I have a section called "The Bug" which needs a technical description of the error and the fix. I'm hoping Adam or Randall can write this, as they are most familiar with the issues.
>> 
>> Once it is ready, we should do our best to make sure our users get a chance to read it.
> 
> I made a few more minor adjustments (see page history when you are logged in) and have nothing more to add myself, but I'd appreciate if Adam or Randall could add a few more tech bits.
> 
> --
> 
> In the meantime, I've put up a BIG FAT WARNING on the CouchDB downloads page:  
> 
>  http://couchdb.apache.org/downloads.html
> 
> I plan to update the warning with a link to the post-mortem once that is done.
> 
> --
> 
> Thanks everybody for being on top of this!
> 
> Cheers
> Jan
> -- 
> 
> 
> 
>> 
>> Thanks,
>> Chris
>> 
>> On Aug 8, 2010, at 5:16 AM, Robert Newson wrote:
>> 
>>> That was also Adam's conclusion (data loss bug confined to 1.0.0).
>>> 
>>> B.
>>> 
>>> On Sun, Aug 8, 2010 at 1:10 PM, Jan Lehnardt <ja...@apache.org> wrote:
>>>> 
>>>> On 8 Aug 2010, at 13:48, Noah Slater wrote:
>>>> 
>>>>> Do we need to abort 0.11.2 as well?
>>>> 
>>>> 0.11.x does not have this commit as far as I can see.
>>>> 
>>>> Cheers
>>>> Jan
>>>> --
>>>> 
>>>>> 
>>>>> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
>>>>> 
>>>>>> 
>>>>>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>>>>>> 
>>>>>>>> is this serious enough to justify pulling current 1.0.0 release
>>>>>>>> binaries to avoid further installs putting data at risk?
>>>>>>>> 
>>>>>>> 
>>>>>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do.
>>>>>> 
>>>>>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>>>>>> 
>>>>>> 
>>>>>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>>>>>> 
>>>>>> +1.
>>>>>> 
>>>>>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>>>>>> 
>>>>>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>>>>>> 
>>>>>> I think so, too.
>>>>>> 
>>>>>> Cheers
>>>>>> Jan
>>>>>> --
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>> Yes. Adam already back ported it.
>>>>>>>>> 
>>>>>>>>> Sent from my interstellar unicorn.
>>>>>>>>> 
>>>>>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>>>>>> 
>>>>>>>>> Time to abort the vote then?
>>>>>>>>> 
>>>>>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>>>>>> 
>>>>>>>>>> Thanks.
>>>>>>>>>> 
>>>>>>>>>> Anyone up to create a repair tool for w...
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>

Re: Data loss

Posted by Jan Lehnardt <ja...@apache.org>.

On 8 Aug 2010, at 18:37, J Chris Anderson wrote:

> Devs,
> 
> I have started a document which we will use when announcing the bug. I plan to move the document from this wiki location to the http://couchdb.apache.org site before the end of the day. Please review and edit the document before then.
> 
> http://wiki.couchone.com/page/post-mortem
> 
> I have a section called "The Bug" which needs a technical description of the error and the fix. I'm hoping Adam or Randall can write this, as they are most familiar with the issues.
> 
> Once it is ready, we should do our best to make sure our users get a chance to read it.

I made a few more minor adjustments (see page history when you are logged in) and have nothing more to add myself, but I'd appreciate if Adam or Randall could add a few more tech bits.

--

In the meantime, I've put up a BIG FAT WARNING on the CouchDB downloads page:  

  http://couchdb.apache.org/downloads.html

I plan to update the warning with a link to the post-mortem once that is done.

--

Thanks everybody for being on top of this!

Cheers
Jan
-- 



> 
> Thanks,
> Chris
> 
> On Aug 8, 2010, at 5:16 AM, Robert Newson wrote:
> 
>> That was also Adam's conclusion (data loss bug confined to 1.0.0).
>> 
>> B.
>> 
>> On Sun, Aug 8, 2010 at 1:10 PM, Jan Lehnardt <ja...@apache.org> wrote:
>>> 
>>> On 8 Aug 2010, at 13:48, Noah Slater wrote:
>>> 
>>>> Do we need to abort 0.11.2 as well?
>>> 
>>> 0.11.x does not have this commit as far as I can see.
>>> 
>>> Cheers
>>> Jan
>>> --
>>> 
>>>> 
>>>> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
>>>> 
>>>>> 
>>>>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>>>>> 
>>>>>> 
>>>>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>>>>> 
>>>>>>> is this serious enough to justify pulling current 1.0.0 release
>>>>>>> binaries to avoid further installs putting data at risk?
>>>>>>> 
>>>>>> 
>>>>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do.
>>>>> 
>>>>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>>>>> 
>>>>> 
>>>>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>>>>> 
>>>>> +1.
>>>>> 
>>>>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>>>>> 
>>>>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>>>>> 
>>>>> I think so, too.
>>>>> 
>>>>> Cheers
>>>>> Jan
>>>>> --
>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>> Yes. Adam already back ported it.
>>>>>>>> 
>>>>>>>> Sent from my interstellar unicorn.
>>>>>>>> 
>>>>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>>>>> 
>>>>>>>> Time to abort the vote then?
>>>>>>>> 
>>>>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>>>>> 
>>>>>>>>> Thanks.
>>>>>>>>> 
>>>>>>>>> Anyone up to create a repair tool for w...
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>

Re: Data loss

Posted by J Chris Anderson <jc...@gmail.com>.

Devs,

I have started a document which we will use when announcing the bug. I plan to move the document from this wiki location to the http://couchdb.apache.org site before the end of the day. Please review and edit the document before then.

http://wiki.couchone.com/page/post-mortem

I have a section called "The Bug" which needs a technical description of the error and the fix. I'm hoping Adam or Randall can write this, as they are most familiar with the issues.

Once it is ready, we should do our best to make sure our users get a chance to read it.

Thanks,
Chris

On Aug 8, 2010, at 5:16 AM, Robert Newson wrote:

> That was also Adam's conclusion (data loss bug confined to 1.0.0).
> 
> B.
> 
> On Sun, Aug 8, 2010 at 1:10 PM, Jan Lehnardt <ja...@apache.org> wrote:
>> 
>> On 8 Aug 2010, at 13:48, Noah Slater wrote:
>> 
>>> Do we need to abort 0.11.2 as well?
>> 
>> 0.11.x does not have this commit as far as I can see.
>> 
>> Cheers
>> Jan
>> --
>> 
>>> 
>>> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
>>> 
>>>> 
>>>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>>>> 
>>>>> 
>>>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>>>> 
>>>>>> is this serious enough to justify pulling current 1.0.0 release
>>>>>> binaries to avoid further installs putting data at risk?
>>>>>> 
>>>>> 
>>>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do.
>>>> 
>>>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>>>> 
>>>> 
>>>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>>>> 
>>>> +1.
>>>> 
>>>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>>>> 
>>>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>>>> 
>>>> I think so, too.
>>>> 
>>>> Cheers
>>>> Jan
>>>> --
>>>> 
>>>>> 
>>>>> 
>>>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>> Yes. Adam already back ported it.
>>>>>>> 
>>>>>>> Sent from my interstellar unicorn.
>>>>>>> 
>>>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>>>> 
>>>>>>> Time to abort the vote then?
>>>>>>> 
>>>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>>>> 
>>>>>>> 
>>>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>>>> 
>>>>>>>> Thanks.
>>>>>>>> 
>>>>>>>> Anyone up to create a repair tool for w...
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>>

Re: Data loss

Posted by Robert Newson <ro...@gmail.com>.

That was also Adam's conclusion (data loss bug confined to 1.0.0).

B.

On Sun, Aug 8, 2010 at 1:10 PM, Jan Lehnardt <ja...@apache.org> wrote:
>
> On 8 Aug 2010, at 13:48, Noah Slater wrote:
>
>> Do we need to abort 0.11.2 as well?
>
> 0.11.x does not have this commit as far as I can see.
>
> Cheers
> Jan
> --
>
>>
>> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
>>
>>>
>>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>>>
>>>>
>>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>>>
>>>>> is this serious enough to justify pulling current 1.0.0 release
>>>>> binaries to avoid further installs putting data at risk?
>>>>>
>>>>
>>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do.
>>>
>>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>>>
>>>
>>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>>>
>>> +1.
>>>
>>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>>>
>>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>>>
>>> I think so, too.
>>>
>>> Cheers
>>> Jan
>>> --
>>>
>>>>
>>>>
>>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>>> Yes. Adam already back ported it.
>>>>>>
>>>>>> Sent from my interstellar unicorn.
>>>>>>
>>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>>>
>>>>>> Time to abort the vote then?
>>>>>>
>>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>>>
>>>>>>
>>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> Anyone up to create a repair tool for w...
>>>>>>
>>>>
>>>
>>
>
>

Re: Data loss

Posted by Jan Lehnardt <ja...@apache.org>.

On 8 Aug 2010, at 13:48, Noah Slater wrote:

> Do we need to abort 0.11.2 as well?

0.11.x does not have this commit as far as I can see.

Cheers
Jan
-- 

> 
> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
> 
>> 
>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>> 
>>> 
>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>> 
>>>> is this serious enough to justify pulling current 1.0.0 release
>>>> binaries to avoid further installs putting data at risk?
>>>> 
>>> 
>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do. 
>> 
>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>> 
>> 
>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>> 
>> +1.
>> 
>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>> 
>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>> 
>> I think so, too.
>> 
>> Cheers
>> Jan
>> --
>> 
>>> 
>>> 
>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>> Yes. Adam already back ported it.
>>>>> 
>>>>> Sent from my interstellar unicorn.
>>>>> 
>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>> 
>>>>> Time to abort the vote then?
>>>>> 
>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>> 
>>>>> 
>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> Anyone up to create a repair tool for w...
>>>>> 
>>> 
>> 
>

Re: Data loss

Posted by Noah Slater <ns...@apache.org>.

Do we need to abort 0.11.2 as well?

On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:

> 
> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
> 
>> 
>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>> 
>>> is this serious enough to justify pulling current 1.0.0 release
>>> binaries to avoid further installs putting data at risk?
>>> 
>> 
>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do. 
> 
> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
> 
> 
>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
> 
> +1.
> 
>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>> 
>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
> 
> I think so, too.
> 
> Cheers
> Jan
> --
> 
>> 
>> 
>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>> Yes. Adam already back ported it.
>>>> 
>>>> Sent from my interstellar unicorn.
>>>> 
>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>> 
>>>> Time to abort the vote then?
>>>> 
>>>> I'd like to get this fix into 1.0.1 if possible.
>>>> 
>>>> 
>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>> 
>>>>> Thanks.
>>>>> 
>>>>> Anyone up to create a repair tool for w...
>>>> 
>> 
>

Re: Data loss

Posted by Jan Lehnardt <ja...@apache.org>.

On 8 Aug 2010, at 06:35, J Chris Anderson wrote:

> 
> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
> 
>> is this serious enough to justify pulling current 1.0.0 release
>> binaries to avoid further installs putting data at risk?
>> 
> 
> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do. 

Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.


> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.

+1.

> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
> 
> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)

I think so, too.

Cheers
Jan
--

> 
> 
>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>> Yes. Adam already back ported it.
>>> 
>>> Sent from my interstellar unicorn.
>>> 
>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>> 
>>> Time to abort the vote then?
>>> 
>>> I'd like to get this fix into 1.0.1 if possible.
>>> 
>>> 
>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>> 
>>>> Thanks.
>>>> 
>>>> Anyone up to create a repair tool for w...
>>> 
>

Re: Data loss

Posted by J Chris Anderson <jc...@apache.org>.

On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:

> is this serious enough to justify pulling current 1.0.0 release
> binaries to avoid further installs putting data at risk?
> 

I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do. 

Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.

I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.

We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)

Chris

> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>> Yes. Adam already back ported it.
>> 
>> Sent from my interstellar unicorn.
>> 
>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>> 
>> Time to abort the vote then?
>> 
>> I'd like to get this fix into 1.0.1 if possible.
>> 
>> 
>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>> 
>>> Thanks.
>>> 
>>> Anyone up to create a repair tool for w...
>>

Re: Data loss

Posted by Dave Cottlehuber <da...@muse.net.nz>.

is this serious enough to justify pulling current 1.0.0 release
binaries to avoid further installs putting data at risk?

On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
> Yes. Adam already back ported it.
>
> Sent from my interstellar unicorn.
>
> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>
> Time to abort the vote then?
>
> I'd like to get this fix into 1.0.1 if possible.
>
>
> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>
>> Thanks.
>>
>> Anyone up to create a repair tool for w...
>

Re: Data loss

Posted by Randall Leeds <ra...@gmail.com>.

Yes. Adam already back ported it.

Sent from my interstellar unicorn.

On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:

Time to abort the vote then?

I'd like to get this fix into 1.0.1 if possible.

On 8 Aug 2010, at 02:28, Damien Katz wrote:

> Thanks.
>
> Anyone up to create a repair tool for w...

Re: Data loss

Posted by Noah Slater <ns...@apache.org>.

Time to abort the vote then?

I'd like to get this fix into 1.0.1 if possible.

On 8 Aug 2010, at 02:28, Damien Katz wrote:

> Thanks.
> 
> Anyone up to create a repair tool for when this happens? It should be possible to find the previous header, then find the most recent btree roots, find the high seq and apply them to the header and commit. I'm thinking this would be a one time server upgrade script.
> 
> -Damien
> 
> 
> On Aug 7, 2010, at 5:47 PM, Adam Kocoloski wrote:
> 
>> Committed to trunk and 1.0.x.
>> 
>> On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote:
>> 
>>> http://github.com/tilgovi/couchdb/tree/fixlostcommits
>>> 
>>> Test and fix in separate commits at the end of that branch, based off
>>> current trunk.
>>> Would appreciate verification that the test is initially broken but
>>> fixed by the patch.
>>> 
>>> On Sat, Aug 7, 2010 at 17:16, Damien Katz <da...@apache.org> wrote:
>>>> I reproduced this manually:
>>>> 
>>>> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).
>>>> 
>>>> Attempt to create document "x" again, get conflict error.
>>>> 
>>>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>>>> 
>>>> Now create document "y".
>>>> 
>>>> Wait at least 2 secs because the delayed commit should happen
>>>> 
>>>> Restart server.
>>>> 
>>>> Document "y" is now missing.
>>>> 
>>>> The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.
>>>> 
>>>> -Damien
>>>> 
>>>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>>>> 
>>>>> I believe it's a single delayed conflict write attempt and no successes in that same interval.
>>>>> 
>>>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>>>> 
>>>>>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>>>>>> 
>>>>>> 1.0 loses data. This is ridiculously bad.
>>>>>> 
>>>>>> We need a test to reproduce this and fix.
>>>>>> 
>>>>>> -Damien
>>>>>> 
>>>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>>>> 
>>>>>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>>>>>> 
>>>>>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>>>>>> 
>>>>>>> Adam
>>>>>>> 
>>>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>>>> 
>>>>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>>>>>> 
>>>>>>>> Try to reproduce this and add a test case.
>>>>>>>> 
>>>>>>>> -Damien
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>>>> 
>>>>>>>>> I think you may be right, Damien.
>>>>>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>>>>>> this what you're thinking? This would be the fix:
>>>>>>>>> http://gist.github.com/513282
>>>>>>>>> 
>>>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>>>>>> 
>>>>>>>>>> -Damien
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>>>> 
>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Damien
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>>>>>> least one situation where we do not commit data correctly during
>>>>>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>>>> http://gist.github.com/513081
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>

Re: Data loss

Posted by Damien Katz <da...@apache.org>.

Thanks.

Anyone up to create a repair tool for when this happens? It should be possible to find the previous header, then find the most recent btree roots, find the high seq and apply them to the header and commit. I'm thinking this would be a one time server upgrade script.

-Damien


On Aug 7, 2010, at 5:47 PM, Adam Kocoloski wrote:

> Committed to trunk and 1.0.x.
> 
> On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote:
> 
>> http://github.com/tilgovi/couchdb/tree/fixlostcommits
>> 
>> Test and fix in separate commits at the end of that branch, based off
>> current trunk.
>> Would appreciate verification that the test is initially broken but
>> fixed by the patch.
>> 
>> On Sat, Aug 7, 2010 at 17:16, Damien Katz <da...@apache.org> wrote:
>>> I reproduced this manually:
>>> 
>>> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).
>>> 
>>> Attempt to create document "x" again, get conflict error.
>>> 
>>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>>> 
>>> Now create document "y".
>>> 
>>> Wait at least 2 secs because the delayed commit should happen
>>> 
>>> Restart server.
>>> 
>>> Document "y" is now missing.
>>> 
>>> The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.
>>> 
>>> -Damien
>>> 
>>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>>> 
>>>> I believe it's a single delayed conflict write attempt and no successes in that same interval.
>>>> 
>>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>>> 
>>>>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>>>>> 
>>>>> 1.0 loses data. This is ridiculously bad.
>>>>> 
>>>>> We need a test to reproduce this and fix.
>>>>> 
>>>>> -Damien
>>>>> 
>>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>>> 
>>>>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>>>>> 
>>>>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>>>>> 
>>>>>> Adam
>>>>>> 
>>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>>> 
>>>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>>>>> 
>>>>>>> Try to reproduce this and add a test case.
>>>>>>> 
>>>>>>> -Damien
>>>>>>> 
>>>>>>> 
>>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>>> 
>>>>>>>> I think you may be right, Damien.
>>>>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>>>>> this what you're thinking? This would be the fix:
>>>>>>>> http://gist.github.com/513282
>>>>>>>> 
>>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>>>>> 
>>>>>>>>> -Damien
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>>> 
>>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Damien
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>>>>> least one situation where we do not commit data correctly during
>>>>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>>> http://gist.github.com/513081
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>

Re: Data loss

Posted by Adam Kocoloski <ko...@apache.org>.

Committed to trunk and 1.0.x.

On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote:

> http://github.com/tilgovi/couchdb/tree/fixlostcommits
> 
> Test and fix in separate commits at the end of that branch, based off
> current trunk.
> Would appreciate verification that the test is initially broken but
> fixed by the patch.
> 
> On Sat, Aug 7, 2010 at 17:16, Damien Katz <da...@apache.org> wrote:
>> I reproduced this manually:
>> 
>> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).
>> 
>> Attempt to create document "x" again, get conflict error.
>> 
>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>> 
>> Now create document "y".
>> 
>> Wait at least 2 secs because the delayed commit should happen
>> 
>> Restart server.
>> 
>> Document "y" is now missing.
>> 
>> The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.
>> 
>> -Damien
>> 
>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>> 
>>> I believe it's a single delayed conflict write attempt and no successes in that same interval.
>>> 
>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>> 
>>>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>>>> 
>>>> 1.0 loses data. This is ridiculously bad.
>>>> 
>>>> We need a test to reproduce this and fix.
>>>> 
>>>> -Damien
>>>> 
>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>> 
>>>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>>>> 
>>>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>>>> 
>>>>> Adam
>>>>> 
>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>> 
>>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>>>> 
>>>>>> Try to reproduce this and add a test case.
>>>>>> 
>>>>>> -Damien
>>>>>> 
>>>>>> 
>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>> 
>>>>>>> I think you may be right, Damien.
>>>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>>>> this what you're thinking? This would be the fix:
>>>>>>> http://gist.github.com/513282
>>>>>>> 
>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>>>> 
>>>>>>>> -Damien
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>> 
>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>> 
>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>> 
>>>>>>>>>>> -Damien
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>>>> least one situation where we do not commit data correctly during
>>>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>> http://gist.github.com/513081
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>>

Re: Data loss

Posted by Randall Leeds <ra...@gmail.com>.

http://github.com/tilgovi/couchdb/tree/fixlostcommits

Test and fix in separate commits at the end of that branch, based off
current trunk.
Would appreciate verification that the test is initially broken but
fixed by the patch.

On Sat, Aug 7, 2010 at 17:16, Damien Katz <da...@apache.org> wrote:
> I reproduced this manually:
>
> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).
>
> Attempt to create document "x" again, get conflict error.
>
> Wait at least 2 secs to ensure the delayed commit attempt happens.
>
> Now create document "y".
>
> Wait at least 2 secs because the delayed commit should happen
>
> Restart server.
>
> Document "y" is now missing.
>
> The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.
>
> -Damien
>
> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>
>> I believe it's a single delayed conflict write attempt and no successes in that same interval.
>>
>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>
>>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>>>
>>> 1.0 loses data. This is ridiculously bad.
>>>
>>> We need a test to reproduce this and fix.
>>>
>>> -Damien
>>>
>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>
>>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>>>
>>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>>>
>>>> Adam
>>>>
>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>
>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>>>
>>>>> Try to reproduce this and add a test case.
>>>>>
>>>>> -Damien
>>>>>
>>>>>
>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>
>>>>>> I think you may be right, Damien.
>>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>>> this what you're thinking? This would be the fix:
>>>>>> http://gist.github.com/513282
>>>>>>
>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>>>
>>>>>>> -Damien
>>>>>>>
>>>>>>>
>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>
>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>>> imagine how it's a problem.
>>>>>>>>>
>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>
>>>>>>>>>> -Damien
>>>>>>>>>
>>>>>>>>
>>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>>> least one situation where we do not commit data correctly during
>>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>> http://gist.github.com/513081
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: Data loss

Posted by Randall Leeds <ra...@gmail.com>.

http://github.com/tilgovi/couchdb/tree/fixlostcommits

Test and fix in separate commits at the end of that branch, based off
current trunk.
Would appreciate verification that the test is initially broken but
fixed by the patch.

On Sat, Aug 7, 2010 at 17:16, Damien Katz <da...@apache.org> wrote:
> I reproduced this manually:
>
> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).
>
> Attempt to create document "x" again, get conflict error.
>
> Wait at least 2 secs to ensure the delayed commit attempt happens.
>
> Now create document "y".
>
> Wait at least 2 secs because the delayed commit should happen
>
> Restart server.
>
> Document "y" is now missing.
>
> The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.
>
> -Damien
>
> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>
>> I believe it's a single delayed conflict write attempt and no successes in that same interval.
>>
>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>
>>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>>>
>>> 1.0 loses data. This is ridiculously bad.
>>>
>>> We need a test to reproduce this and fix.
>>>
>>> -Damien
>>>
>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>
>>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>>>
>>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>>>
>>>> Adam
>>>>
>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>
>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>>>
>>>>> Try to reproduce this and add a test case.
>>>>>
>>>>> -Damien
>>>>>
>>>>>
>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>
>>>>>> I think you may be right, Damien.
>>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>>> this what you're thinking? This would be the fix:
>>>>>> http://gist.github.com/513282
>>>>>>
>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>>>
>>>>>>> -Damien
>>>>>>>
>>>>>>>
>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>
>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>>> imagine how it's a problem.
>>>>>>>>>
>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>
>>>>>>>>>> -Damien
>>>>>>>>>
>>>>>>>>
>>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>>> least one situation where we do not commit data correctly during
>>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>> http://gist.github.com/513081
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>
>