You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Yue Chuan Lim <sh...@gmail.com> on 2010/08/07 05:58:55 UTC

Data loss

I have a set of documents that have been committed for more then a day,
regularly read from without a problem. Had to stop the database service to
do some debugging, used the couchdb.bat provided in CouchDB/bin for easy
access to the log. And I noticed that I basically lost all the documents in
question.

There does not appear to be corruption per se, but it is as if my database
just rolled back to the state it was in a few days ago, i.e. most of my
documents are there but some old documents that I'm pretty sure I have
deleted are back, and my newer documents are gone.

Appears to have happened to me more then once, shrugged it off the last time
as it might be just a mix up, but I am definite that my database has
certainly rolled back this time.

Is there any situation in which this might happen?

Thanks
Yue Chuan

Re: Data loss

Posted by Damien Katz <da...@apache.org>.

We don't know if that will fix anything. ensure_full_commit is a no-op is the db server thinks there is nothing to commit.

-Damien

On Aug 7, 2010, at 12:03 PM, J Chris Anderson wrote:

> A solution will be to to POST to /db/_ensure_full_commit with a content type of application/json and an empty body, before doing any restart.
> 
> 
> Chris
> 
> 
> On Aug 7, 2010, at 11:56 AM, Randall Leeds wrote:
> 
>> I agree completely! I immediately thought of this because I wrote that
>> change. I spent a while staring at it last night but still can't
>> imagine how it's a problem.
>> 
>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>> SVN commit r954043 looks suspicious. Digging further.
>>> 
>>> -Damien
>

Re: Data loss

Posted by Randall Leeds <ra...@gmail.com>.

Well. That'll help, but it's not a *solution* because it seems like
we're leaving data uncommitted for a long time in some case.

On Sat, Aug 7, 2010 at 12:03, J Chris Anderson <jc...@apache.org> wrote:
> A solution will be to to POST to /db/_ensure_full_commit with a content type of application/json and an empty body, before doing any restart.
>
>
> Chris
>
>
> On Aug 7, 2010, at 11:56 AM, Randall Leeds wrote:
>
>> I agree completely! I immediately thought of this because I wrote that
>> change. I spent a while staring at it last night but still can't
>> imagine how it's a problem.
>>
>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>> SVN commit r954043 looks suspicious. Digging further.
>>>
>>> -Damien
>
>

Re: Data loss

Posted by J Chris Anderson <jc...@apache.org>.

A solution will be to to POST to /db/_ensure_full_commit with a content type of application/json and an empty body, before doing any restart.

Chris

On Aug 7, 2010, at 11:56 AM, Randall Leeds wrote:

> I agree completely! I immediately thought of this because I wrote that
> change. I spent a while staring at it last night but still can't
> imagine how it's a problem.
> 
> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>> SVN commit r954043 looks suspicious. Digging further.
>> 
>> -Damien

Re: Data loss

Posted by Robert Newson <ro...@gmail.com>.

Yes there is, and it bears repeating;

"POSTing to /db/_ensure_full_commit will still cause a header to be written."

B.

On Sun, Aug 8, 2010 at 8:40 AM, Sascha Reuter <s....@geek-it.de> wrote:
> Is there any way to manually trigger a commit before stopping, upgrading and restarting the server, so dataloss can be prevented??! If so, this should be marked big in the announcement!
>
> Cheers,
>
> Sascha
>
>>
>> 1.0 loses data. This is ridiculously bad.
>>
>

Re: Data loss

Posted by Sascha Reuter <s....@geek-it.de>.

Is there any way to manually trigger a commit before stopping, upgrading and restarting the server, so dataloss can be prevented??! If so, this should be marked big in the announcement!

Cheers,

Sascha

> 
> 1.0 loses data. This is ridiculously bad.
>

Re: Data loss

Posted by Randall Leeds <ra...@gmail.com>.

On Sat, Aug 7, 2010 at 18:01, Adam Kocoloski <ko...@apache.org> wrote:
> POSTing to /db/_ensure_full_commit will still cause a header to be written.
>
> Switching to delayed_commits = false and then writing a document will cause a header to be written for that DB.
>
> POSTing to /_ensure_full_commit for each DB and then flipping the delayed_commits to false will put a 1.0.0 server into a safe state with all data saved.

The safest, I think, would be to flip to delayed_commits=false first
and then post to /_ensure_full_commit on each DB.

Re: Data loss

Posted by Adam Kocoloski <ko...@apache.org>.

POSTing to /db/_ensure_full_commit will still cause a header to be written.

Switching to delayed_commits = false and then writing a document will cause a header to be written for that DB.

POSTing to /_ensure_full_commit for each DB and then flipping the delayed_commits to false will put a 1.0.0 server into a safe state with all data saved.

Adam

On Aug 7, 2010, at 8:57 PM, Chris Anderson wrote:

> Will switching a running 1.0 server to delayed_commits=true cause the noncommitted headers to be written? Are there other remedies for folks with critical data in 1.0 who want to ensure they are safe?
> 
> Chris
> 
> Typed on glass.
> 
> On Aug 7, 2010, at 5:47 PM, Adam Kocoloski <ko...@apache.org> wrote:
> 
>> Committed to trunk and 1.0.x.
>> 
>> On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote:
>> 
>>> http://github.com/tilgovi/couchdb/tree/fixlostcommits
>>> 
>>> Test and fix in separate commits at the end of that branch, based off
>>> current trunk.
>>> Would appreciate verification that the test is initially broken but
>>> fixed by the patch.
>>> 
>>> On Sat, Aug 7, 2010 at 17:16, Damien Katz <da...@apache.org> wrote:
>>>> I reproduced this manually:
>>>> 
>>>> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).
>>>> 
>>>> Attempt to create document "x" again, get conflict error.
>>>> 
>>>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>>>> 
>>>> Now create document "y".
>>>> 
>>>> Wait at least 2 secs because the delayed commit should happen
>>>> 
>>>> Restart server.
>>>> 
>>>> Document "y" is now missing.
>>>> 
>>>> The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.
>>>> 
>>>> -Damien
>>>> 
>>>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>>>> 
>>>>> I believe it's a single delayed conflict write attempt and no successes in that same interval.
>>>>> 
>>>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>>>> 
>>>>>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>>>>>> 
>>>>>> 1.0 loses data. This is ridiculously bad.
>>>>>> 
>>>>>> We need a test to reproduce this and fix.
>>>>>> 
>>>>>> -Damien
>>>>>> 
>>>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>>>> 
>>>>>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>>>>>> 
>>>>>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>>>>>> 
>>>>>>> Adam
>>>>>>> 
>>>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>>>> 
>>>>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>>>>>> 
>>>>>>>> Try to reproduce this and add a test case.
>>>>>>>> 
>>>>>>>> -Damien
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>>>> 
>>>>>>>>> I think you may be right, Damien.
>>>>>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>>>>>> this what you're thinking? This would be the fix:
>>>>>>>>> http://gist.github.com/513282
>>>>>>>>> 
>>>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>>>>>> 
>>>>>>>>>> -Damien
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>>>> 
>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Damien
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>>>>>> least one situation where we do not commit data correctly during
>>>>>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>>>> http://gist.github.com/513081
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>

Re: Data loss

Posted by Chris Anderson <jc...@gmail.com>.

Will switching a running 1.0 server to delayed_commits=true cause the noncommitted headers to be written? Are there other remedies for folks with critical data in 1.0 who want to ensure they are safe?

Chris

Typed on glass.

On Aug 7, 2010, at 5:47 PM, Adam Kocoloski <ko...@apache.org> wrote:

> Committed to trunk and 1.0.x.
> 
> On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote:
> 
>> http://github.com/tilgovi/couchdb/tree/fixlostcommits
>> 
>> Test and fix in separate commits at the end of that branch, based off
>> current trunk.
>> Would appreciate verification that the test is initially broken but
>> fixed by the patch.
>> 
>> On Sat, Aug 7, 2010 at 17:16, Damien Katz <da...@apache.org> wrote:
>>> I reproduced this manually:
>>> 
>>> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).
>>> 
>>> Attempt to create document "x" again, get conflict error.
>>> 
>>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>>> 
>>> Now create document "y".
>>> 
>>> Wait at least 2 secs because the delayed commit should happen
>>> 
>>> Restart server.
>>> 
>>> Document "y" is now missing.
>>> 
>>> The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.
>>> 
>>> -Damien
>>> 
>>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>>> 
>>>> I believe it's a single delayed conflict write attempt and no successes in that same interval.
>>>> 
>>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>>> 
>>>>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>>>>> 
>>>>> 1.0 loses data. This is ridiculously bad.
>>>>> 
>>>>> We need a test to reproduce this and fix.
>>>>> 
>>>>> -Damien
>>>>> 
>>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>>> 
>>>>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>>>>> 
>>>>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>>>>> 
>>>>>> Adam
>>>>>> 
>>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>>> 
>>>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>>>>> 
>>>>>>> Try to reproduce this and add a test case.
>>>>>>> 
>>>>>>> -Damien
>>>>>>> 
>>>>>>> 
>>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>>> 
>>>>>>>> I think you may be right, Damien.
>>>>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>>>>> this what you're thinking? This would be the fix:
>>>>>>>> http://gist.github.com/513282
>>>>>>>> 
>>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>>>>> 
>>>>>>>>> -Damien
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>>> 
>>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Damien
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>>>>> least one situation where we do not commit data correctly during
>>>>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>>> http://gist.github.com/513081
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>

Re: Data loss

Posted by Jan Lehnardt <ja...@apache.org>.

On 8 Aug 2010, at 13:48, Noah Slater wrote:

> Do we need to abort 0.11.2 as well?
> 
> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
> 
>> 
>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>> 
>>> 
>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>> 
>>>> is this serious enough to justify pulling current 1.0.0 release
>>>> binaries to avoid further installs putting data at risk?
>>>> 
>>> 
>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do. 
>> 
>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>> 
>> 
>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>> 
>> +1.
>> 
>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>> 
>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>> 
>> I think so, too.
>> 
>> Cheers
>> Jan
>> --
>> 
>>> 
>>> 
>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>> Yes. Adam already back ported it.
>>>>> 
>>>>> Sent from my interstellar unicorn.
>>>>> 
>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>> 
>>>>> Time to abort the vote then?
>>>>> 
>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>> 
>>>>> 
>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> Anyone up to create a repair tool for w...
>>>>> 
>>> 
>> 
>

[NOTICE] Data loss bug (and fix)

Posted by J Chris Anderson <jc...@apache.org>.

Over the weekend of August 7th–8th, 2010 we discovered and fixed a nasty bug in CouchDB 1.0.0. There is potential data-loss for users of 1.0.0, running with the default configuration of delayed_commits=true.

We've issued an in-place fix and details about the data loss bug here:

http://couchdb.apache.org/notice/1.0.1.html

The 1.0.1 release will make a permanent fix, but in the mean time, following these instructions will ensure your data is safe.

Chris

[NOTICE] Data loss bug (and fix)

Posted by J Chris Anderson <jc...@apache.org>.

Over the weekend of August 7th–8th, 2010 we discovered and fixed a nasty bug in CouchDB 1.0.0. There is potential data-loss for users of 1.0.0, running with the default configuration of delayed_commits=true.

We've issued an in-place fix and details about the data loss bug here:

http://couchdb.apache.org/notice/1.0.1.html

The 1.0.1 release will make a permanent fix, but in the mean time, following these instructions will ensure your data is safe.

Chris

Re: Data loss

Posted by Jan Lehnardt <ja...@apache.org>.

On 8 Aug 2010, at 21:49, Noah Slater wrote:

> Done.
> 
> The public site should update within the hour.
> 
> The official distribution directory no longer has 1.0.0, but the mirrors will for another 24 hours.

Randall was so kind to update the technical details in Chris's wiki page. I took the liberty (and help from Noah) to add it on the site under notice/1.0.1.html (as a release notice for the upcoming 1.0.1 release. I also updated the downloads page to point to the notice. It'll be up with in the hour (or two).

Thanks again all for getting this resolved so quickly. The team spirit here really makes this a fun project :)

Cheers
Jan
-- 

> 
> On 8 Aug 2010, at 20:43, Jan Lehnardt wrote:
> 
>> 
>> On 8 Aug 2010, at 21:24, Noah Slater wrote:
>> 
>>> What you are suggesting isn archival of the release, which means removing it from the downloads page, the distribution directory, and the mirrors. I can do this, but I'd like to know that we have consensus first. The plan as I understood it was to archive this release at the same time as making the 1.0.1 release.
>> 
>> I'd like to follow that plan.
>> 
>> Cheers
>> Jan
>> -- 
>> 
>>> 
>>> On 8 Aug 2010, at 20:21, Robert Dionne wrote:
>>> 
>>>> I would also consider removing the download link for 1.0.0 and not depend on users patching it. It's broken.
>>>> 
>>>> I have to believe there are users who won't and who won't read the red sign. There's a good probability these are the kinds of users who will also be the most upset by data loss
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Aug 8, 2010, at 3:06 PM, Jan Lehnardt wrote:
>>>> 
>>>>> 
>>>>> On 8 Aug 2010, at 18:37, J Chris Anderson wrote:
>>>>> 
>>>>>> Devs,
>>>>>> 
>>>>>> I have started a document which we will use when announcing the bug. I plan to move the document from this wiki location to the http://couchdb.apache.org site before the end of the day. Please review and edit the document before then.
>>>>>> 
>>>>>> http://wiki.couchone.com/page/post-mortem
>>>>>> 
>>>>>> I have a section called "The Bug" which needs a technical description of the error and the fix. I'm hoping Adam or Randall can write this, as they are most familiar with the issues.
>>>>>> 
>>>>>> Once it is ready, we should do our best to make sure our users get a chance to read it.
>>>>> 
>>>>> I made a few more minor adjustments (see page history when you are logged in) and have nothing more to add myself, but I'd appreciate if Adam or Randall could add a few more tech bits.
>>>>> 
>>>>> --
>>>>> 
>>>>> In the meantime, I've put up a BIG FAT WARNING on the CouchDB downloads page:  
>>>>> 
>>>>> http://couchdb.apache.org/downloads.html
>>>>> 
>>>>> I plan to update the warning with a link to the post-mortem once that is done.
>>>>> 
>>>>> --
>>>>> 
>>>>> Thanks everybody for being on top of this!
>>>>> 
>>>>> Cheers
>>>>> Jan
>>>>> -- 
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> Chris
>>>>>> 
>>>>>> On Aug 8, 2010, at 5:16 AM, Robert Newson wrote:
>>>>>> 
>>>>>>> That was also Adam's conclusion (data loss bug confined to 1.0.0).
>>>>>>> 
>>>>>>> B.
>>>>>>> 
>>>>>>> On Sun, Aug 8, 2010 at 1:10 PM, Jan Lehnardt <ja...@apache.org> wrote:
>>>>>>>> 
>>>>>>>> On 8 Aug 2010, at 13:48, Noah Slater wrote:
>>>>>>>> 
>>>>>>>>> Do we need to abort 0.11.2 as well?
>>>>>>>> 
>>>>>>>> 0.11.x does not have this commit as far as I can see.
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> Jan
>>>>>>>> --
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> is this serious enough to justify pulling current 1.0.0 release
>>>>>>>>>>>> binaries to avoid further installs putting data at risk?
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do.
>>>>>>>>>> 
>>>>>>>>>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>>>>>>>>>> 
>>>>>>>>>> +1.
>>>>>>>>>> 
>>>>>>>>>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>>>>>>>>>> 
>>>>>>>>>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>>>>>>>>>> 
>>>>>>>>>> I think so, too.
>>>>>>>>>> 
>>>>>>>>>> Cheers
>>>>>>>>>> Jan
>>>>>>>>>> --
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>>>>> Yes. Adam already back ported it.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Sent from my interstellar unicorn.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Time to abort the vote then?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Anyone up to create a repair tool for w...
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Data loss

Posted by Noah Slater <ns...@apache.org>.

Done.

The public site should update within the hour.

The official distribution directory no longer has 1.0.0, but the mirrors will for another 24 hours.

On 8 Aug 2010, at 20:43, Jan Lehnardt wrote:

> 
> On 8 Aug 2010, at 21:24, Noah Slater wrote:
> 
>> What you are suggesting isn archival of the release, which means removing it from the downloads page, the distribution directory, and the mirrors. I can do this, but I'd like to know that we have consensus first. The plan as I understood it was to archive this release at the same time as making the 1.0.1 release.
> 
> I'd like to follow that plan.
> 
> Cheers
> Jan
> -- 
> 
>> 
>> On 8 Aug 2010, at 20:21, Robert Dionne wrote:
>> 
>>> I would also consider removing the download link for 1.0.0 and not depend on users patching it. It's broken.
>>> 
>>> I have to believe there are users who won't and who won't read the red sign. There's a good probability these are the kinds of users who will also be the most upset by data loss
>>> 
>>> 
>>> 
>>> 
>>> On Aug 8, 2010, at 3:06 PM, Jan Lehnardt wrote:
>>> 
>>>> 
>>>> On 8 Aug 2010, at 18:37, J Chris Anderson wrote:
>>>> 
>>>>> Devs,
>>>>> 
>>>>> I have started a document which we will use when announcing the bug. I plan to move the document from this wiki location to the http://couchdb.apache.org site before the end of the day. Please review and edit the document before then.
>>>>> 
>>>>> http://wiki.couchone.com/page/post-mortem
>>>>> 
>>>>> I have a section called "The Bug" which needs a technical description of the error and the fix. I'm hoping Adam or Randall can write this, as they are most familiar with the issues.
>>>>> 
>>>>> Once it is ready, we should do our best to make sure our users get a chance to read it.
>>>> 
>>>> I made a few more minor adjustments (see page history when you are logged in) and have nothing more to add myself, but I'd appreciate if Adam or Randall could add a few more tech bits.
>>>> 
>>>> --
>>>> 
>>>> In the meantime, I've put up a BIG FAT WARNING on the CouchDB downloads page:  
>>>> 
>>>> http://couchdb.apache.org/downloads.html
>>>> 
>>>> I plan to update the warning with a link to the post-mortem once that is done.
>>>> 
>>>> --
>>>> 
>>>> Thanks everybody for being on top of this!
>>>> 
>>>> Cheers
>>>> Jan
>>>> -- 
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Chris
>>>>> 
>>>>> On Aug 8, 2010, at 5:16 AM, Robert Newson wrote:
>>>>> 
>>>>>> That was also Adam's conclusion (data loss bug confined to 1.0.0).
>>>>>> 
>>>>>> B.
>>>>>> 
>>>>>> On Sun, Aug 8, 2010 at 1:10 PM, Jan Lehnardt <ja...@apache.org> wrote:
>>>>>>> 
>>>>>>> On 8 Aug 2010, at 13:48, Noah Slater wrote:
>>>>>>> 
>>>>>>>> Do we need to abort 0.11.2 as well?
>>>>>>> 
>>>>>>> 0.11.x does not have this commit as far as I can see.
>>>>>>> 
>>>>>>> Cheers
>>>>>>> Jan
>>>>>>> --
>>>>>>> 
>>>>>>>> 
>>>>>>>> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>>>>>>>>> 
>>>>>>>>>>> is this serious enough to justify pulling current 1.0.0 release
>>>>>>>>>>> binaries to avoid further installs putting data at risk?
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do.
>>>>>>>>> 
>>>>>>>>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>>>>>>>>> 
>>>>>>>>> +1.
>>>>>>>>> 
>>>>>>>>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>>>>>>>>> 
>>>>>>>>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>>>>>>>>> 
>>>>>>>>> I think so, too.
>>>>>>>>> 
>>>>>>>>> Cheers
>>>>>>>>> Jan
>>>>>>>>> --
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>>>> Yes. Adam already back ported it.
>>>>>>>>>>>> 
>>>>>>>>>>>> Sent from my interstellar unicorn.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Time to abort the vote then?
>>>>>>>>>>>> 
>>>>>>>>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Anyone up to create a repair tool for w...
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Data loss

Posted by Jan Lehnardt <ja...@apache.org>.

On 8 Aug 2010, at 21:24, Noah Slater wrote:

> What you are suggesting isn archival of the release, which means removing it from the downloads page, the distribution directory, and the mirrors. I can do this, but I'd like to know that we have consensus first. The plan as I understood it was to archive this release at the same time as making the 1.0.1 release.

I'd like to follow that plan.

Cheers
Jan
-- 

> 
> On 8 Aug 2010, at 20:21, Robert Dionne wrote:
> 
>> I would also consider removing the download link for 1.0.0 and not depend on users patching it. It's broken.
>> 
>> I have to believe there are users who won't and who won't read the red sign. There's a good probability these are the kinds of users who will also be the most upset by data loss
>> 
>> 
>> 
>> 
>> On Aug 8, 2010, at 3:06 PM, Jan Lehnardt wrote:
>> 
>>> 
>>> On 8 Aug 2010, at 18:37, J Chris Anderson wrote:
>>> 
>>>> Devs,
>>>> 
>>>> I have started a document which we will use when announcing the bug. I plan to move the document from this wiki location to the http://couchdb.apache.org site before the end of the day. Please review and edit the document before then.
>>>> 
>>>> http://wiki.couchone.com/page/post-mortem
>>>> 
>>>> I have a section called "The Bug" which needs a technical description of the error and the fix. I'm hoping Adam or Randall can write this, as they are most familiar with the issues.
>>>> 
>>>> Once it is ready, we should do our best to make sure our users get a chance to read it.
>>> 
>>> I made a few more minor adjustments (see page history when you are logged in) and have nothing more to add myself, but I'd appreciate if Adam or Randall could add a few more tech bits.
>>> 
>>> --
>>> 
>>> In the meantime, I've put up a BIG FAT WARNING on the CouchDB downloads page:  
>>> 
>>> http://couchdb.apache.org/downloads.html
>>> 
>>> I plan to update the warning with a link to the post-mortem once that is done.
>>> 
>>> --
>>> 
>>> Thanks everybody for being on top of this!
>>> 
>>> Cheers
>>> Jan
>>> -- 
>>> 
>>> 
>>> 
>>>> 
>>>> Thanks,
>>>> Chris
>>>> 
>>>> On Aug 8, 2010, at 5:16 AM, Robert Newson wrote:
>>>> 
>>>>> That was also Adam's conclusion (data loss bug confined to 1.0.0).
>>>>> 
>>>>> B.
>>>>> 
>>>>> On Sun, Aug 8, 2010 at 1:10 PM, Jan Lehnardt <ja...@apache.org> wrote:
>>>>>> 
>>>>>> On 8 Aug 2010, at 13:48, Noah Slater wrote:
>>>>>> 
>>>>>>> Do we need to abort 0.11.2 as well?
>>>>>> 
>>>>>> 0.11.x does not have this commit as far as I can see.
>>>>>> 
>>>>>> Cheers
>>>>>> Jan
>>>>>> --
>>>>>> 
>>>>>>> 
>>>>>>> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>>>>>>>> 
>>>>>>>>>> is this serious enough to justify pulling current 1.0.0 release
>>>>>>>>>> binaries to avoid further installs putting data at risk?
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do.
>>>>>>>> 
>>>>>>>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>>>>>>>> 
>>>>>>>> +1.
>>>>>>>> 
>>>>>>>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>>>>>>>> 
>>>>>>>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>>>>>>>> 
>>>>>>>> I think so, too.
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> Jan
>>>>>>>> --
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>>> Yes. Adam already back ported it.
>>>>>>>>>>> 
>>>>>>>>>>> Sent from my interstellar unicorn.
>>>>>>>>>>> 
>>>>>>>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Time to abort the vote then?
>>>>>>>>>>> 
>>>>>>>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>> 
>>>>>>>>>>>> Anyone up to create a repair tool for w...
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 
>

Re: Data loss

Posted by Noah Slater <ns...@apache.org>.

What you are suggesting isn archival of the release, which means removing it from the downloads page, the distribution directory, and the mirrors. I can do this, but I'd like to know that we have consensus first. The plan as I understood it was to archive this release at the same time as making the 1.0.1 release.

On 8 Aug 2010, at 20:21, Robert Dionne wrote:

> I would also consider removing the download link for 1.0.0 and not depend on users patching it. It's broken.
> 
> I have to believe there are users who won't and who won't read the red sign. There's a good probability these are the kinds of users who will also be the most upset by data loss
> 
> 
> 
> 
> On Aug 8, 2010, at 3:06 PM, Jan Lehnardt wrote:
> 
>> 
>> On 8 Aug 2010, at 18:37, J Chris Anderson wrote:
>> 
>>> Devs,
>>> 
>>> I have started a document which we will use when announcing the bug. I plan to move the document from this wiki location to the http://couchdb.apache.org site before the end of the day. Please review and edit the document before then.
>>> 
>>> http://wiki.couchone.com/page/post-mortem
>>> 
>>> I have a section called "The Bug" which needs a technical description of the error and the fix. I'm hoping Adam or Randall can write this, as they are most familiar with the issues.
>>> 
>>> Once it is ready, we should do our best to make sure our users get a chance to read it.
>> 
>> I made a few more minor adjustments (see page history when you are logged in) and have nothing more to add myself, but I'd appreciate if Adam or Randall could add a few more tech bits.
>> 
>> --
>> 
>> In the meantime, I've put up a BIG FAT WARNING on the CouchDB downloads page:  
>> 
>> http://couchdb.apache.org/downloads.html
>> 
>> I plan to update the warning with a link to the post-mortem once that is done.
>> 
>> --
>> 
>> Thanks everybody for being on top of this!
>> 
>> Cheers
>> Jan
>> -- 
>> 
>> 
>> 
>>> 
>>> Thanks,
>>> Chris
>>> 
>>> On Aug 8, 2010, at 5:16 AM, Robert Newson wrote:
>>> 
>>>> That was also Adam's conclusion (data loss bug confined to 1.0.0).
>>>> 
>>>> B.
>>>> 
>>>> On Sun, Aug 8, 2010 at 1:10 PM, Jan Lehnardt <ja...@apache.org> wrote:
>>>>> 
>>>>> On 8 Aug 2010, at 13:48, Noah Slater wrote:
>>>>> 
>>>>>> Do we need to abort 0.11.2 as well?
>>>>> 
>>>>> 0.11.x does not have this commit as far as I can see.
>>>>> 
>>>>> Cheers
>>>>> Jan
>>>>> --
>>>>> 
>>>>>> 
>>>>>> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>>>>>>> 
>>>>>>>>> is this serious enough to justify pulling current 1.0.0 release
>>>>>>>>> binaries to avoid further installs putting data at risk?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do.
>>>>>>> 
>>>>>>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>>>>>>> 
>>>>>>> 
>>>>>>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>>>>>>> 
>>>>>>> +1.
>>>>>>> 
>>>>>>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>>>>>>> 
>>>>>>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>>>>>>> 
>>>>>>> I think so, too.
>>>>>>> 
>>>>>>> Cheers
>>>>>>> Jan
>>>>>>> --
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>> Yes. Adam already back ported it.
>>>>>>>>>> 
>>>>>>>>>> Sent from my interstellar unicorn.
>>>>>>>>>> 
>>>>>>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>>>>>>> 
>>>>>>>>>> Time to abort the vote then?
>>>>>>>>>> 
>>>>>>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks.
>>>>>>>>>>> 
>>>>>>>>>>> Anyone up to create a repair tool for w...
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>

Re: Data loss

Posted by Robert Dionne <di...@dionne-associates.com>.

I would also consider removing the download link for 1.0.0 and not depend on users patching it. It's broken.

I have to believe there are users who won't and who won't read the red sign. There's a good probability these are the kinds of users who will also be the most upset by data loss




On Aug 8, 2010, at 3:06 PM, Jan Lehnardt wrote:

> 
> On 8 Aug 2010, at 18:37, J Chris Anderson wrote:
> 
>> Devs,
>> 
>> I have started a document which we will use when announcing the bug. I plan to move the document from this wiki location to the http://couchdb.apache.org site before the end of the day. Please review and edit the document before then.
>> 
>> http://wiki.couchone.com/page/post-mortem
>> 
>> I have a section called "The Bug" which needs a technical description of the error and the fix. I'm hoping Adam or Randall can write this, as they are most familiar with the issues.
>> 
>> Once it is ready, we should do our best to make sure our users get a chance to read it.
> 
> I made a few more minor adjustments (see page history when you are logged in) and have nothing more to add myself, but I'd appreciate if Adam or Randall could add a few more tech bits.
> 
> --
> 
> In the meantime, I've put up a BIG FAT WARNING on the CouchDB downloads page:  
> 
>  http://couchdb.apache.org/downloads.html
> 
> I plan to update the warning with a link to the post-mortem once that is done.
> 
> --
> 
> Thanks everybody for being on top of this!
> 
> Cheers
> Jan
> -- 
> 
> 
> 
>> 
>> Thanks,
>> Chris
>> 
>> On Aug 8, 2010, at 5:16 AM, Robert Newson wrote:
>> 
>>> That was also Adam's conclusion (data loss bug confined to 1.0.0).
>>> 
>>> B.
>>> 
>>> On Sun, Aug 8, 2010 at 1:10 PM, Jan Lehnardt <ja...@apache.org> wrote:
>>>> 
>>>> On 8 Aug 2010, at 13:48, Noah Slater wrote:
>>>> 
>>>>> Do we need to abort 0.11.2 as well?
>>>> 
>>>> 0.11.x does not have this commit as far as I can see.
>>>> 
>>>> Cheers
>>>> Jan
>>>> --
>>>> 
>>>>> 
>>>>> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
>>>>> 
>>>>>> 
>>>>>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>>>>>> 
>>>>>>>> is this serious enough to justify pulling current 1.0.0 release
>>>>>>>> binaries to avoid further installs putting data at risk?
>>>>>>>> 
>>>>>>> 
>>>>>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do.
>>>>>> 
>>>>>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>>>>>> 
>>>>>> 
>>>>>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>>>>>> 
>>>>>> +1.
>>>>>> 
>>>>>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>>>>>> 
>>>>>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>>>>>> 
>>>>>> I think so, too.
>>>>>> 
>>>>>> Cheers
>>>>>> Jan
>>>>>> --
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>> Yes. Adam already back ported it.
>>>>>>>>> 
>>>>>>>>> Sent from my interstellar unicorn.
>>>>>>>>> 
>>>>>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>>>>>> 
>>>>>>>>> Time to abort the vote then?
>>>>>>>>> 
>>>>>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>>>>>> 
>>>>>>>>>> Thanks.
>>>>>>>>>> 
>>>>>>>>>> Anyone up to create a repair tool for w...
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>

Re: Data loss

Posted by Jan Lehnardt <ja...@apache.org>.

On 8 Aug 2010, at 18:37, J Chris Anderson wrote:

> Devs,
> 
> I have started a document which we will use when announcing the bug. I plan to move the document from this wiki location to the http://couchdb.apache.org site before the end of the day. Please review and edit the document before then.
> 
> http://wiki.couchone.com/page/post-mortem
> 
> I have a section called "The Bug" which needs a technical description of the error and the fix. I'm hoping Adam or Randall can write this, as they are most familiar with the issues.
> 
> Once it is ready, we should do our best to make sure our users get a chance to read it.

I made a few more minor adjustments (see page history when you are logged in) and have nothing more to add myself, but I'd appreciate if Adam or Randall could add a few more tech bits.

--

In the meantime, I've put up a BIG FAT WARNING on the CouchDB downloads page:  

  http://couchdb.apache.org/downloads.html

I plan to update the warning with a link to the post-mortem once that is done.

--

Thanks everybody for being on top of this!

Cheers
Jan
-- 



> 
> Thanks,
> Chris
> 
> On Aug 8, 2010, at 5:16 AM, Robert Newson wrote:
> 
>> That was also Adam's conclusion (data loss bug confined to 1.0.0).
>> 
>> B.
>> 
>> On Sun, Aug 8, 2010 at 1:10 PM, Jan Lehnardt <ja...@apache.org> wrote:
>>> 
>>> On 8 Aug 2010, at 13:48, Noah Slater wrote:
>>> 
>>>> Do we need to abort 0.11.2 as well?
>>> 
>>> 0.11.x does not have this commit as far as I can see.
>>> 
>>> Cheers
>>> Jan
>>> --
>>> 
>>>> 
>>>> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
>>>> 
>>>>> 
>>>>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>>>>> 
>>>>>> 
>>>>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>>>>> 
>>>>>>> is this serious enough to justify pulling current 1.0.0 release
>>>>>>> binaries to avoid further installs putting data at risk?
>>>>>>> 
>>>>>> 
>>>>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do.
>>>>> 
>>>>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>>>>> 
>>>>> 
>>>>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>>>>> 
>>>>> +1.
>>>>> 
>>>>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>>>>> 
>>>>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>>>>> 
>>>>> I think so, too.
>>>>> 
>>>>> Cheers
>>>>> Jan
>>>>> --
>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>> Yes. Adam already back ported it.
>>>>>>>> 
>>>>>>>> Sent from my interstellar unicorn.
>>>>>>>> 
>>>>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>>>>> 
>>>>>>>> Time to abort the vote then?
>>>>>>>> 
>>>>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>>>>> 
>>>>>>>>> Thanks.
>>>>>>>>> 
>>>>>>>>> Anyone up to create a repair tool for w...
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>

Re: Data loss

Posted by J Chris Anderson <jc...@gmail.com>.

Devs,

I have started a document which we will use when announcing the bug. I plan to move the document from this wiki location to the http://couchdb.apache.org site before the end of the day. Please review and edit the document before then.

http://wiki.couchone.com/page/post-mortem

I have a section called "The Bug" which needs a technical description of the error and the fix. I'm hoping Adam or Randall can write this, as they are most familiar with the issues.

Once it is ready, we should do our best to make sure our users get a chance to read it.

Thanks,
Chris

On Aug 8, 2010, at 5:16 AM, Robert Newson wrote:

> That was also Adam's conclusion (data loss bug confined to 1.0.0).
> 
> B.
> 
> On Sun, Aug 8, 2010 at 1:10 PM, Jan Lehnardt <ja...@apache.org> wrote:
>> 
>> On 8 Aug 2010, at 13:48, Noah Slater wrote:
>> 
>>> Do we need to abort 0.11.2 as well?
>> 
>> 0.11.x does not have this commit as far as I can see.
>> 
>> Cheers
>> Jan
>> --
>> 
>>> 
>>> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
>>> 
>>>> 
>>>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>>>> 
>>>>> 
>>>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>>>> 
>>>>>> is this serious enough to justify pulling current 1.0.0 release
>>>>>> binaries to avoid further installs putting data at risk?
>>>>>> 
>>>>> 
>>>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do.
>>>> 
>>>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>>>> 
>>>> 
>>>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>>>> 
>>>> +1.
>>>> 
>>>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>>>> 
>>>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>>>> 
>>>> I think so, too.
>>>> 
>>>> Cheers
>>>> Jan
>>>> --
>>>> 
>>>>> 
>>>>> 
>>>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>> Yes. Adam already back ported it.
>>>>>>> 
>>>>>>> Sent from my interstellar unicorn.
>>>>>>> 
>>>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>>>> 
>>>>>>> Time to abort the vote then?
>>>>>>> 
>>>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>>>> 
>>>>>>> 
>>>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>>>> 
>>>>>>>> Thanks.
>>>>>>>> 
>>>>>>>> Anyone up to create a repair tool for w...
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>>

Re: Data loss

Posted by Robert Newson <ro...@gmail.com>.

That was also Adam's conclusion (data loss bug confined to 1.0.0).

B.

On Sun, Aug 8, 2010 at 1:10 PM, Jan Lehnardt <ja...@apache.org> wrote:
>
> On 8 Aug 2010, at 13:48, Noah Slater wrote:
>
>> Do we need to abort 0.11.2 as well?
>
> 0.11.x does not have this commit as far as I can see.
>
> Cheers
> Jan
> --
>
>>
>> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
>>
>>>
>>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>>>
>>>>
>>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>>>
>>>>> is this serious enough to justify pulling current 1.0.0 release
>>>>> binaries to avoid further installs putting data at risk?
>>>>>
>>>>
>>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do.
>>>
>>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>>>
>>>
>>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>>>
>>> +1.
>>>
>>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>>>
>>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>>>
>>> I think so, too.
>>>
>>> Cheers
>>> Jan
>>> --
>>>
>>>>
>>>>
>>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>>> Yes. Adam already back ported it.
>>>>>>
>>>>>> Sent from my interstellar unicorn.
>>>>>>
>>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>>>
>>>>>> Time to abort the vote then?
>>>>>>
>>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>>>
>>>>>>
>>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> Anyone up to create a repair tool for w...
>>>>>>
>>>>
>>>
>>
>
>

Re: Data loss

Posted by Jan Lehnardt <ja...@apache.org>.

On 8 Aug 2010, at 13:48, Noah Slater wrote:

> Do we need to abort 0.11.2 as well?

0.11.x does not have this commit as far as I can see.

Cheers
Jan
-- 

> 
> On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:
> 
>> 
>> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
>> 
>>> 
>>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>>> 
>>>> is this serious enough to justify pulling current 1.0.0 release
>>>> binaries to avoid further installs putting data at risk?
>>>> 
>>> 
>>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do. 
>> 
>> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
>> 
>> 
>>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
>> 
>> +1.
>> 
>>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>>> 
>>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
>> 
>> I think so, too.
>> 
>> Cheers
>> Jan
>> --
>> 
>>> 
>>> 
>>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>>> Yes. Adam already back ported it.
>>>>> 
>>>>> Sent from my interstellar unicorn.
>>>>> 
>>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>>> 
>>>>> Time to abort the vote then?
>>>>> 
>>>>> I'd like to get this fix into 1.0.1 if possible.
>>>>> 
>>>>> 
>>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> Anyone up to create a repair tool for w...
>>>>> 
>>> 
>> 
>

Re: Data loss

Posted by Noah Slater <ns...@apache.org>.

Do we need to abort 0.11.2 as well?

On 8 Aug 2010, at 11:45, Jan Lehnardt wrote:

> 
> On 8 Aug 2010, at 06:35, J Chris Anderson wrote:
> 
>> 
>> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
>> 
>>> is this serious enough to justify pulling current 1.0.0 release
>>> binaries to avoid further installs putting data at risk?
>>> 
>> 
>> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do. 
> 
> Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.
> 
> 
>> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.
> 
> +1.
> 
>> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
>> 
>> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)
> 
> I think so, too.
> 
> Cheers
> Jan
> --
> 
>> 
>> 
>>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>>> Yes. Adam already back ported it.
>>>> 
>>>> Sent from my interstellar unicorn.
>>>> 
>>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>>> 
>>>> Time to abort the vote then?
>>>> 
>>>> I'd like to get this fix into 1.0.1 if possible.
>>>> 
>>>> 
>>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>>> 
>>>>> Thanks.
>>>>> 
>>>>> Anyone up to create a repair tool for w...
>>>> 
>> 
>

Re: Data loss

Posted by Jan Lehnardt <ja...@apache.org>.

On 8 Aug 2010, at 06:35, J Chris Anderson wrote:

> 
> On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:
> 
>> is this serious enough to justify pulling current 1.0.0 release
>> binaries to avoid further installs putting data at risk?
>> 
> 
> I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do. 

Altering releases are a no-no. The only real procedure is to release a new version and deprecate the old one, while optionally keeping it around for posterity.


> Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.

+1.

> I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.
> 
> We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)

I think so, too.

Cheers
Jan
--

> 
> 
>> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>>> Yes. Adam already back ported it.
>>> 
>>> Sent from my interstellar unicorn.
>>> 
>>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>>> 
>>> Time to abort the vote then?
>>> 
>>> I'd like to get this fix into 1.0.1 if possible.
>>> 
>>> 
>>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>>> 
>>>> Thanks.
>>>> 
>>>> Anyone up to create a repair tool for w...
>>> 
>

Re: Data loss

Posted by J Chris Anderson <jc...@apache.org>.

On Aug 7, 2010, at 8:45 PM, Dave Cottlehuber wrote:

> is this serious enough to justify pulling current 1.0.0 release
> binaries to avoid further installs putting data at risk?
> 

I'm not sure what Apache policy is about altering a release after the fact. It's probably up to use to decide what to do. 

Probably as soon as 1.0.1 is available we should pull the 1.0.0 release off of the downloads page, etc.

I also think we should do a post-mortem blog post announcing the issue and the remedy, as well as digging into how we can prevent this sort of thing in the future.

We should make an official announcement before the end of the weekend, with very clear steps to remedy it. (Eg: config delayed_commits to false *without restarting the server* etc)

Chris

> On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
>> Yes. Adam already back ported it.
>> 
>> Sent from my interstellar unicorn.
>> 
>> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>> 
>> Time to abort the vote then?
>> 
>> I'd like to get this fix into 1.0.1 if possible.
>> 
>> 
>> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>> 
>>> Thanks.
>>> 
>>> Anyone up to create a repair tool for w...
>>

Re: Data loss

Posted by Dave Cottlehuber <da...@muse.net.nz>.

is this serious enough to justify pulling current 1.0.0 release
binaries to avoid further installs putting data at risk?

On 8 August 2010 15:08, Randall Leeds <ra...@gmail.com> wrote:
> Yes. Adam already back ported it.
>
> Sent from my interstellar unicorn.
>
> On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:
>
> Time to abort the vote then?
>
> I'd like to get this fix into 1.0.1 if possible.
>
>
> On 8 Aug 2010, at 02:28, Damien Katz wrote:
>
>> Thanks.
>>
>> Anyone up to create a repair tool for w...
>

Re: Data loss

Posted by Randall Leeds <ra...@gmail.com>.

Yes. Adam already back ported it.

Sent from my interstellar unicorn.

On Aug 7, 2010 8:03 PM, "Noah Slater" <ns...@apache.org> wrote:

Time to abort the vote then?

I'd like to get this fix into 1.0.1 if possible.

On 8 Aug 2010, at 02:28, Damien Katz wrote:

> Thanks.
>
> Anyone up to create a repair tool for w...

Re: Data loss

Posted by Noah Slater <ns...@apache.org>.

Time to abort the vote then?

I'd like to get this fix into 1.0.1 if possible.

On 8 Aug 2010, at 02:28, Damien Katz wrote:

> Thanks.
> 
> Anyone up to create a repair tool for when this happens? It should be possible to find the previous header, then find the most recent btree roots, find the high seq and apply them to the header and commit. I'm thinking this would be a one time server upgrade script.
> 
> -Damien
> 
> 
> On Aug 7, 2010, at 5:47 PM, Adam Kocoloski wrote:
> 
>> Committed to trunk and 1.0.x.
>> 
>> On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote:
>> 
>>> http://github.com/tilgovi/couchdb/tree/fixlostcommits
>>> 
>>> Test and fix in separate commits at the end of that branch, based off
>>> current trunk.
>>> Would appreciate verification that the test is initially broken but
>>> fixed by the patch.
>>> 
>>> On Sat, Aug 7, 2010 at 17:16, Damien Katz <da...@apache.org> wrote:
>>>> I reproduced this manually:
>>>> 
>>>> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).
>>>> 
>>>> Attempt to create document "x" again, get conflict error.
>>>> 
>>>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>>>> 
>>>> Now create document "y".
>>>> 
>>>> Wait at least 2 secs because the delayed commit should happen
>>>> 
>>>> Restart server.
>>>> 
>>>> Document "y" is now missing.
>>>> 
>>>> The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.
>>>> 
>>>> -Damien
>>>> 
>>>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>>>> 
>>>>> I believe it's a single delayed conflict write attempt and no successes in that same interval.
>>>>> 
>>>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>>>> 
>>>>>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>>>>>> 
>>>>>> 1.0 loses data. This is ridiculously bad.
>>>>>> 
>>>>>> We need a test to reproduce this and fix.
>>>>>> 
>>>>>> -Damien
>>>>>> 
>>>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>>>> 
>>>>>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>>>>>> 
>>>>>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>>>>>> 
>>>>>>> Adam
>>>>>>> 
>>>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>>>> 
>>>>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>>>>>> 
>>>>>>>> Try to reproduce this and add a test case.
>>>>>>>> 
>>>>>>>> -Damien
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>>>> 
>>>>>>>>> I think you may be right, Damien.
>>>>>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>>>>>> this what you're thinking? This would be the fix:
>>>>>>>>> http://gist.github.com/513282
>>>>>>>>> 
>>>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>>>>>> 
>>>>>>>>>> -Damien
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>>>> 
>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Damien
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>>>>>> least one situation where we do not commit data correctly during
>>>>>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>>>> http://gist.github.com/513081
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>

Re: Data loss

Posted by Damien Katz <da...@apache.org>.

Thanks.

Anyone up to create a repair tool for when this happens? It should be possible to find the previous header, then find the most recent btree roots, find the high seq and apply them to the header and commit. I'm thinking this would be a one time server upgrade script.

-Damien


On Aug 7, 2010, at 5:47 PM, Adam Kocoloski wrote:

> Committed to trunk and 1.0.x.
> 
> On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote:
> 
>> http://github.com/tilgovi/couchdb/tree/fixlostcommits
>> 
>> Test and fix in separate commits at the end of that branch, based off
>> current trunk.
>> Would appreciate verification that the test is initially broken but
>> fixed by the patch.
>> 
>> On Sat, Aug 7, 2010 at 17:16, Damien Katz <da...@apache.org> wrote:
>>> I reproduced this manually:
>>> 
>>> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).
>>> 
>>> Attempt to create document "x" again, get conflict error.
>>> 
>>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>>> 
>>> Now create document "y".
>>> 
>>> Wait at least 2 secs because the delayed commit should happen
>>> 
>>> Restart server.
>>> 
>>> Document "y" is now missing.
>>> 
>>> The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.
>>> 
>>> -Damien
>>> 
>>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>>> 
>>>> I believe it's a single delayed conflict write attempt and no successes in that same interval.
>>>> 
>>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>>> 
>>>>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>>>>> 
>>>>> 1.0 loses data. This is ridiculously bad.
>>>>> 
>>>>> We need a test to reproduce this and fix.
>>>>> 
>>>>> -Damien
>>>>> 
>>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>>> 
>>>>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>>>>> 
>>>>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>>>>> 
>>>>>> Adam
>>>>>> 
>>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>>> 
>>>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>>>>> 
>>>>>>> Try to reproduce this and add a test case.
>>>>>>> 
>>>>>>> -Damien
>>>>>>> 
>>>>>>> 
>>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>>> 
>>>>>>>> I think you may be right, Damien.
>>>>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>>>>> this what you're thinking? This would be the fix:
>>>>>>>> http://gist.github.com/513282
>>>>>>>> 
>>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>>>>> 
>>>>>>>>> -Damien
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>>> 
>>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Damien
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>>>>> least one situation where we do not commit data correctly during
>>>>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>>> http://gist.github.com/513081
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>

Re: Data loss

Posted by Adam Kocoloski <ko...@apache.org>.

Committed to trunk and 1.0.x.

On Aug 7, 2010, at 8:33 PM, Randall Leeds wrote:

> http://github.com/tilgovi/couchdb/tree/fixlostcommits
> 
> Test and fix in separate commits at the end of that branch, based off
> current trunk.
> Would appreciate verification that the test is initially broken but
> fixed by the patch.
> 
> On Sat, Aug 7, 2010 at 17:16, Damien Katz <da...@apache.org> wrote:
>> I reproduced this manually:
>> 
>> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).
>> 
>> Attempt to create document "x" again, get conflict error.
>> 
>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>> 
>> Now create document "y".
>> 
>> Wait at least 2 secs because the delayed commit should happen
>> 
>> Restart server.
>> 
>> Document "y" is now missing.
>> 
>> The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.
>> 
>> -Damien
>> 
>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>> 
>>> I believe it's a single delayed conflict write attempt and no successes in that same interval.
>>> 
>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>> 
>>>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>>>> 
>>>> 1.0 loses data. This is ridiculously bad.
>>>> 
>>>> We need a test to reproduce this and fix.
>>>> 
>>>> -Damien
>>>> 
>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>> 
>>>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>>>> 
>>>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>>>> 
>>>>> Adam
>>>>> 
>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>> 
>>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>>>> 
>>>>>> Try to reproduce this and add a test case.
>>>>>> 
>>>>>> -Damien
>>>>>> 
>>>>>> 
>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>> 
>>>>>>> I think you may be right, Damien.
>>>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>>>> this what you're thinking? This would be the fix:
>>>>>>> http://gist.github.com/513282
>>>>>>> 
>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>>>> 
>>>>>>>> -Damien
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>> 
>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>> 
>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>> 
>>>>>>>>>>> -Damien
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>>>> least one situation where we do not commit data correctly during
>>>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>> http://gist.github.com/513081
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>>

Re: Data loss

Posted by Eric Carlson <er...@ericcarlson.co.uk>.

 Just tried it, and I can confirm that the new futon test fails without
the patch but succeeds with it. Also, Damien's manual method of
reproducing the problem loses data without the patch but everything
seems to work correctly with the patch.

-Eric

On 08/08/10 01:33, Randall Leeds wrote:
> http://github.com/tilgovi/couchdb/tree/fixlostcommits
>
> Test and fix in separate commits at the end of that branch, based off
> current trunk.
> Would appreciate verification that the test is initially broken but
> fixed by the patch.
>
> On Sat, Aug 7, 2010 at 17:16, Damien Katz <da...@apache.org> wrote:
>> I reproduced this manually:
>>
>> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).
>>
>> Attempt to create document "x" again, get conflict error.
>>
>> Wait at least 2 secs to ensure the delayed commit attempt happens.
>>
>> Now create document "y".
>>
>> Wait at least 2 secs because the delayed commit should happen
>>
>> Restart server.
>>
>> Document "y" is now missing.
>>
>> The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.
>>
>> -Damien
>>
>> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>>
>>> I believe it's a single delayed conflict write attempt and no successes in that same interval.
>>>
>>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>>
>>>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>>>>
>>>> 1.0 loses data. This is ridiculously bad.
>>>>
>>>> We need a test to reproduce this and fix.
>>>>
>>>> -Damien
>>>>
>>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>>
>>>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>>>>
>>>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>>>>
>>>>> Adam
>>>>>
>>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>>
>>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>>>>
>>>>>> Try to reproduce this and add a test case.
>>>>>>
>>>>>> -Damien
>>>>>>
>>>>>>
>>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>>
>>>>>>> I think you may be right, Damien.
>>>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>>>> this what you're thinking? This would be the fix:
>>>>>>> http://gist.github.com/513282
>>>>>>>
>>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>>>>
>>>>>>>> -Damien
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>>
>>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>>>> imagine how it's a problem.
>>>>>>>>>>
>>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>>
>>>>>>>>>>> -Damien
>>>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>>>> least one situation where we do not commit data correctly during
>>>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>>> http://gist.github.com/513081
>>>>>>>>
>>

Re: Data loss

Posted by Randall Leeds <ra...@gmail.com>.

http://github.com/tilgovi/couchdb/tree/fixlostcommits

Test and fix in separate commits at the end of that branch, based off
current trunk.
Would appreciate verification that the test is initially broken but
fixed by the patch.

On Sat, Aug 7, 2010 at 17:16, Damien Katz <da...@apache.org> wrote:
> I reproduced this manually:
>
> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).
>
> Attempt to create document "x" again, get conflict error.
>
> Wait at least 2 secs to ensure the delayed commit attempt happens.
>
> Now create document "y".
>
> Wait at least 2 secs because the delayed commit should happen
>
> Restart server.
>
> Document "y" is now missing.
>
> The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.
>
> -Damien
>
> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>
>> I believe it's a single delayed conflict write attempt and no successes in that same interval.
>>
>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>
>>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>>>
>>> 1.0 loses data. This is ridiculously bad.
>>>
>>> We need a test to reproduce this and fix.
>>>
>>> -Damien
>>>
>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>
>>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>>>
>>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>>>
>>>> Adam
>>>>
>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>
>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>>>
>>>>> Try to reproduce this and add a test case.
>>>>>
>>>>> -Damien
>>>>>
>>>>>
>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>
>>>>>> I think you may be right, Damien.
>>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>>> this what you're thinking? This would be the fix:
>>>>>> http://gist.github.com/513282
>>>>>>
>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>>>
>>>>>>> -Damien
>>>>>>>
>>>>>>>
>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>
>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>>> imagine how it's a problem.
>>>>>>>>>
>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>
>>>>>>>>>> -Damien
>>>>>>>>>
>>>>>>>>
>>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>>> least one situation where we do not commit data correctly during
>>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>> http://gist.github.com/513081
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: Data loss

Posted by Randall Leeds <ra...@gmail.com>.

http://github.com/tilgovi/couchdb/tree/fixlostcommits

Test and fix in separate commits at the end of that branch, based off
current trunk.
Would appreciate verification that the test is initially broken but
fixed by the patch.

On Sat, Aug 7, 2010 at 17:16, Damien Katz <da...@apache.org> wrote:
> I reproduced this manually:
>
> Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).
>
> Attempt to create document "x" again, get conflict error.
>
> Wait at least 2 secs to ensure the delayed commit attempt happens.
>
> Now create document "y".
>
> Wait at least 2 secs because the delayed commit should happen
>
> Restart server.
>
> Document "y" is now missing.
>
> The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.
>
> -Damien
>
> On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:
>
>> I believe it's a single delayed conflict write attempt and no successes in that same interval.
>>
>> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
>>
>>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>>>
>>> 1.0 loses data. This is ridiculously bad.
>>>
>>> We need a test to reproduce this and fix.
>>>
>>> -Damien
>>>
>>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>>>
>>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>>>
>>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>>>
>>>> Adam
>>>>
>>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>>>
>>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>>>
>>>>> Try to reproduce this and add a test case.
>>>>>
>>>>> -Damien
>>>>>
>>>>>
>>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>>>
>>>>>> I think you may be right, Damien.
>>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>>> this what you're thinking? This would be the fix:
>>>>>> http://gist.github.com/513282
>>>>>>
>>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>>>
>>>>>>> -Damien
>>>>>>>
>>>>>>>
>>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>>>
>>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>>> imagine how it's a problem.
>>>>>>>>>
>>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>>>
>>>>>>>>>> -Damien
>>>>>>>>>
>>>>>>>>
>>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>>> least one situation where we do not commit data correctly during
>>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>>> http://gist.github.com/513081
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: Data loss

Posted by Damien Katz <da...@apache.org>.

I reproduced this manually:

Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).

Attempt to create document "x" again, get conflict error.

Wait at least 2 secs to ensure the delayed commit attempt happens.

Now create document "y".

Wait at least 2 secs because the delayed commit should happen

Restart server.

Document "y" is now missing.

The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.

-Damien

On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:

> I believe it's a single delayed conflict write attempt and no successes in that same interval.
> 
> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
> 
>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>> 
>> 1.0 loses data. This is ridiculously bad.
>> 
>> We need a test to reproduce this and fix.
>> 
>> -Damien
>> 
>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>> 
>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>> 
>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>> 
>>> Adam
>>> 
>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>> 
>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>> 
>>>> Try to reproduce this and add a test case.
>>>> 
>>>> -Damien
>>>> 
>>>> 
>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>> 
>>>>> I think you may be right, Damien.
>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>> this what you're thinking? This would be the fix:
>>>>> http://gist.github.com/513282
>>>>> 
>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>> 
>>>>>> -Damien
>>>>>> 
>>>>>> 
>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>> 
>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>> imagine how it's a problem.
>>>>>>>> 
>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>> 
>>>>>>>>> -Damien
>>>>>>>> 
>>>>>>> 
>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>> least one situation where we do not commit data correctly during
>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>> http://gist.github.com/513081
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 
>

Re: Data loss

Posted by Damien Katz <da...@apache.org>.

I reproduced this manually:

Create document with id "x", ensure full commit (simply wait longer than 1 sec, say 2 secs).

Attempt to create document "x" again, get conflict error.

Wait at least 2 secs to ensure the delayed commit attempt happens.

Now create document "y".

Wait at least 2 secs because the delayed commit should happen

Restart server.

Document "y" is now missing.

The last delayed commit isn't happening. From then on out, no docs updated with delayed commit with be available after a restart.

-Damien

On Aug 7, 2010, at 4:58 PM, Adam Kocoloski wrote:

> I believe it's a single delayed conflict write attempt and no successes in that same interval.
> 
> On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:
> 
>> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
>> 
>> 1.0 loses data. This is ridiculously bad.
>> 
>> We need a test to reproduce this and fix.
>> 
>> -Damien
>> 
>> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
>> 
>>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>>> 
>>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>>> 
>>> Adam
>>> 
>>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>>> 
>>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>>> 
>>>> Try to reproduce this and add a test case.
>>>> 
>>>> -Damien
>>>> 
>>>> 
>>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>>> 
>>>>> I think you may be right, Damien.
>>>>> If ever a write happens that only contains conflicts while waiting for
>>>>> a delayed commit message we might still be cancelling the timer. Is
>>>>> this what you're thinking? This would be the fix:
>>>>> http://gist.github.com/513282
>>>>> 
>>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>>> 
>>>>>> -Damien
>>>>>> 
>>>>>> 
>>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>>> 
>>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>>> imagine how it's a problem.
>>>>>>>> 
>>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>>> 
>>>>>>>>> -Damien
>>>>>>>> 
>>>>>>> 
>>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>>> least one situation where we do not commit data correctly during
>>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>>> outside the couch_file:process. Check this diff:
>>>>>>> http://gist.github.com/513081
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 
>

Re: Data loss

Posted by Adam Kocoloski <ko...@apache.org>.

I believe it's a single delayed conflict write attempt and no successes in that same interval.

On Aug 7, 2010, at 7:51 PM, Damien Katz wrote:

> Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.
> 
> 1.0 loses data. This is ridiculously bad.
> 
> We need a test to reproduce this and fix.
> 
> -Damien
> 
> On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:
> 
>> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
>> 
>> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
>> 
>> Adam
>> 
>> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
>> 
>>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>>> 
>>> Try to reproduce this and add a test case.
>>> 
>>> -Damien
>>> 
>>> 
>>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>>> 
>>>> I think you may be right, Damien.
>>>> If ever a write happens that only contains conflicts while waiting for
>>>> a delayed commit message we might still be cancelling the timer. Is
>>>> this what you're thinking? This would be the fix:
>>>> http://gist.github.com/513282
>>>> 
>>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>>> 
>>>>> -Damien
>>>>> 
>>>>> 
>>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>>> 
>>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>>> change. I spent a while staring at it last night but still can't
>>>>>>> imagine how it's a problem.
>>>>>>> 
>>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>>> 
>>>>>>>> -Damien
>>>>>>> 
>>>>>> 
>>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>>> least one situation where we do not commit data correctly during
>>>>>> compaction. This has to do with the way we now use the path to sync
>>>>>> outside the couch_file:process. Check this diff:
>>>>>> http://gist.github.com/513081
>>>>> 
>>>>> 
>>> 
>> 
>

Re: Data loss

Posted by Damien Katz <da...@apache.org>.

Looks like all that's necessary is a single delayed conflict write attempt, and all subsequent delayed commits won't be commit, the header never gets written.

1.0 loses data. This is ridiculously bad.

We need a test to reproduce this and fix.

-Damien

On Aug 7, 2010, at 4:35 PM, Adam Kocoloski wrote:

> Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.
> 
> With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.
> 
> Adam
> 
> On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:
> 
>> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
>> 
>> Try to reproduce this and add a test case.
>> 
>> -Damien
>> 
>> 
>> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
>> 
>>> I think you may be right, Damien.
>>> If ever a write happens that only contains conflicts while waiting for
>>> a delayed commit message we might still be cancelling the timer. Is
>>> this what you're thinking? This would be the fix:
>>> http://gist.github.com/513282
>>> 
>>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>>> 
>>>> -Damien
>>>> 
>>>> 
>>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>>> 
>>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>>> change. I spent a while staring at it last night but still can't
>>>>>> imagine how it's a problem.
>>>>>> 
>>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>>> 
>>>>>>> -Damien
>>>>>> 
>>>>> 
>>>>> I still want to stare at r954043, but it looks to me like there's at
>>>>> least one situation where we do not commit data correctly during
>>>>> compaction. This has to do with the way we now use the path to sync
>>>>> outside the couch_file:process. Check this diff:
>>>>> http://gist.github.com/513081
>>>> 
>>>> 
>> 
>

Re: Data loss

Posted by Adam Kocoloski <ko...@apache.org>.

Good sleuthing guys, and my apologies for letting this through.  Randall, your patch in COUCHDB-794 was actually fine, it was my reworking of it that caused this serious bug.

With respect to that gist 513282, I think it would be better to return Db#db{waiting_delayed_commit=nil} when the headers match instead of moving the cancel_timer() command as you did.  After all, we did perform the check here -- it was just that nothing needed to be committed.

Adam

On Aug 7, 2010, at 6:55 PM, Damien Katz wrote:

> Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.
> 
> Try to reproduce this and add a test case.
> 
> -Damien
> 
> 
> On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:
> 
>> I think you may be right, Damien.
>> If ever a write happens that only contains conflicts while waiting for
>> a delayed commit message we might still be cancelling the timer. Is
>> this what you're thinking? This would be the fix:
>> http://gist.github.com/513282
>> 
>> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>>> 
>>> -Damien
>>> 
>>> 
>>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>>> 
>>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>>> I agree completely! I immediately thought of this because I wrote that
>>>>> change. I spent a while staring at it last night but still can't
>>>>> imagine how it's a problem.
>>>>> 
>>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>>> 
>>>>>> -Damien
>>>>> 
>>>> 
>>>> I still want to stare at r954043, but it looks to me like there's at
>>>> least one situation where we do not commit data correctly during
>>>> compaction. This has to do with the way we now use the path to sync
>>>> outside the couch_file:process. Check this diff:
>>>> http://gist.github.com/513081
>>> 
>>> 
>

Re: Data loss

Posted by Damien Katz <da...@apache.org>.

Yes, I think it requires 2 conflicting writes in row, because it needs to trigger the delayed_commit timer without actually having anything to commit, so the header never changes.

Try to reproduce this and add a test case.

-Damien


On Aug 7, 2010, at 3:47 PM, Randall Leeds wrote:

> I think you may be right, Damien.
> If ever a write happens that only contains conflicts while waiting for
> a delayed commit message we might still be cancelling the timer. Is
> this what you're thinking? This would be the fix:
> http://gist.github.com/513282
> 
> On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
>> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>> 
>> -Damien
>> 
>> 
>> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>> 
>>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>>> I agree completely! I immediately thought of this because I wrote that
>>>> change. I spent a while staring at it last night but still can't
>>>> imagine how it's a problem.
>>>> 
>>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>> 
>>>>> -Damien
>>>> 
>>> 
>>> I still want to stare at r954043, but it looks to me like there's at
>>> least one situation where we do not commit data correctly during
>>> compaction. This has to do with the way we now use the path to sync
>>> outside the couch_file:process. Check this diff:
>>> http://gist.github.com/513081
>> 
>>

Re: Data loss

Posted by Randall Leeds <ra...@gmail.com>.

I think you may be right, Damien.
If ever a write happens that only contains conflicts while waiting for
a delayed commit message we might still be cancelling the timer. Is
this what you're thinking? This would be the fix:
http://gist.github.com/513282

On Sat, Aug 7, 2010 at 15:42, Damien Katz <da...@apache.org> wrote:
> I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.
>
> -Damien
>
>
> On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:
>
>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>> I agree completely! I immediately thought of this because I wrote that
>>> change. I spent a while staring at it last night but still can't
>>> imagine how it's a problem.
>>>
>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>> SVN commit r954043 looks suspicious. Digging further.
>>>>
>>>> -Damien
>>>
>>
>> I still want to stare at r954043, but it looks to me like there's at
>> least one situation where we do not commit data correctly during
>> compaction. This has to do with the way we now use the path to sync
>> outside the couch_file:process. Check this diff:
>> http://gist.github.com/513081
>
>

Re: Data loss

Posted by Damien Katz <da...@apache.org>.

I think the problem might be that 2 conflicting write attempts in row can leave the #db.waiting_delayed_commit set but the timer has been cancelled. One that happens, the header may never be written, as it always thinks a delayed commit will fire soon.

-Damien


On Aug 7, 2010, at 12:08 PM, Randall Leeds wrote:

> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>> I agree completely! I immediately thought of this because I wrote that
>> change. I spent a while staring at it last night but still can't
>> imagine how it's a problem.
>> 
>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>> SVN commit r954043 looks suspicious. Digging further.
>>> 
>>> -Damien
>> 
> 
> I still want to stare at r954043, but it looks to me like there's at
> least one situation where we do not commit data correctly during
> compaction. This has to do with the way we now use the path to sync
> outside the couch_file:process. Check this diff:
> http://gist.github.com/513081

Re: Data loss

Posted by Benoit Chesneau <bc...@gmail.com>.

On Sat, Aug 7, 2010 at 11:00 PM, J Chris Anderson <jc...@apache.org> wrote:
>
> On Aug 7, 2010, at 1:38 PM, Benoit Chesneau wrote:
>
>> On Sat, Aug 7, 2010 at 10:35 PM, J Chris Anderson <jc...@apache.org> wrote:
>>>
>>> On Aug 7, 2010, at 1:24 PM, Noah Slater wrote:
>>>
>>>> This sounds moderately serious.
>>>>
>>>> Would you like me to abort the current vote in leu of a fix?
>>>
>>> I was planning to -1 the vote once we figure out what the deal is. There's a chance the bug only strikes in certain time zones (weird huh?) which might mean it's OK to continue the release.
>>>
>>> Chris
>>>
>> Just to add me to the list,  I noticed  this issue too with geocouch
>> on 1.0. That was on openbsd and mac.
>>
>
> with GeoCouch, you mean in the indexers or the main database file?
>
> Chris
>
>> - benoit
>
>
The main database file like others. After a crash I get old version of
documents.

- benoit

Re: Data loss

Posted by J Chris Anderson <jc...@apache.org>.

On Aug 7, 2010, at 1:38 PM, Benoit Chesneau wrote:

> On Sat, Aug 7, 2010 at 10:35 PM, J Chris Anderson <jc...@apache.org> wrote:
>> 
>> On Aug 7, 2010, at 1:24 PM, Noah Slater wrote:
>> 
>>> This sounds moderately serious.
>>> 
>>> Would you like me to abort the current vote in leu of a fix?
>> 
>> I was planning to -1 the vote once we figure out what the deal is. There's a chance the bug only strikes in certain time zones (weird huh?) which might mean it's OK to continue the release.
>> 
>> Chris
>> 
> Just to add me to the list,  I noticed  this issue too with geocouch
> on 1.0. That was on openbsd and mac.
> 

with GeoCouch, you mean in the indexers or the main database file?

Chris

> - benoit

Re: Data loss

Posted by Benoit Chesneau <bc...@gmail.com>.

On Sat, Aug 7, 2010 at 10:35 PM, J Chris Anderson <jc...@apache.org> wrote:
>
> On Aug 7, 2010, at 1:24 PM, Noah Slater wrote:
>
>> This sounds moderately serious.
>>
>> Would you like me to abort the current vote in leu of a fix?
>
> I was planning to -1 the vote once we figure out what the deal is. There's a chance the bug only strikes in certain time zones (weird huh?) which might mean it's OK to continue the release.
>
> Chris
>
Just to add me to the list,  I noticed  this issue too with geocouch
on 1.0. That was on openbsd and mac.

- benoit

Re: Data loss

Posted by J Chris Anderson <jc...@apache.org>.

On Aug 7, 2010, at 1:24 PM, Noah Slater wrote:

> This sounds moderately serious.
> 
> Would you like me to abort the current vote in leu of a fix?

I was planning to -1 the vote once we figure out what the deal is. There's a chance the bug only strikes in certain time zones (weird huh?) which might mean it's OK to continue the release.

Chris

> 
> On 7 Aug 2010, at 20:08, Randall Leeds wrote:
> 
>> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>>> I agree completely! I immediately thought of this because I wrote that
>>> change. I spent a while staring at it last night but still can't
>>> imagine how it's a problem.
>>> 
>>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>>> SVN commit r954043 looks suspicious. Digging further.
>>>> 
>>>> -Damien
>>> 
>> 
>> I still want to stare at r954043, but it looks to me like there's at
>> least one situation where we do not commit data correctly during
>> compaction. This has to do with the way we now use the path to sync
>> outside the couch_file:process. Check this diff:
>> http://gist.github.com/513081
>

Re: Data loss

Posted by Noah Slater <ns...@apache.org>.

This sounds moderately serious.

Would you like me to abort the current vote in leu of a fix?

On 7 Aug 2010, at 20:08, Randall Leeds wrote:

> On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
>> I agree completely! I immediately thought of this because I wrote that
>> change. I spent a while staring at it last night but still can't
>> imagine how it's a problem.
>> 
>> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>>> SVN commit r954043 looks suspicious. Digging further.
>>> 
>>> -Damien
>> 
> 
> I still want to stare at r954043, but it looks to me like there's at
> least one situation where we do not commit data correctly during
> compaction. This has to do with the way we now use the path to sync
> outside the couch_file:process. Check this diff:
> http://gist.github.com/513081

Re: Data loss

Posted by Randall Leeds <ra...@gmail.com>.

On Sat, Aug 7, 2010 at 11:56, Randall Leeds <ra...@gmail.com> wrote:
> I agree completely! I immediately thought of this because I wrote that
> change. I spent a while staring at it last night but still can't
> imagine how it's a problem.
>
> On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
>> SVN commit r954043 looks suspicious. Digging further.
>>
>> -Damien
>

I still want to stare at r954043, but it looks to me like there's at
least one situation where we do not commit data correctly during
compaction. This has to do with the way we now use the path to sync
outside the couch_file:process. Check this diff:
http://gist.github.com/513081

Re: Data loss

Posted by Randall Leeds <ra...@gmail.com>.

I agree completely! I immediately thought of this because I wrote that
change. I spent a while staring at it last night but still can't
imagine how it's a problem.

On Sat, Aug 7, 2010 at 11:12, Damien Katz <da...@apache.org> wrote:
> SVN commit r954043 looks suspicious. Digging further.
>
> -Damien

Re: Data loss

Posted by Sascha Reuter <s....@geek-it.de>.

Maybe its a good idea to send out a warning or something. My guess is that other people affected by this problem just don't know, until they restart their instances... :-/ don't know... 

Am 07.08.2010 um 20:15 schrieb Volker Mische <vo...@gmail.com>:

> Damine,
> 
> as you found already a suspicious commit, it might not be much help. But the GeoCouch merge with 1.0 had also some issues, the original GeoCouch not, which is based on a checkout around beginning of March.
> 
> Cheers,
>  Volker
> 
> On 08/07/2010 08:12 PM, Damien Katz wrote:
>> SVN commit r954043 looks suspicious. Digging further.
>> 
>> -Damien
>> 
>> On Aug 7, 2010, at 10:31 AM, J Chris Anderson wrote:
>> 
>>> 
>>> On Aug 7, 2010, at 1:21 AM, Sascha Reuter wrote:
>>> 
>>>> Thats exactly what I reported 2 days ago! Bug is already opened and databasefile was provided to the couchio guys! Running on Linux...
>>>> 
>>> 
>>> Thanks, we're keenly interested in seeing what's going on here.
>>> 
>>> Chris
>>> 
>>>> Am 07.08.2010 um 06:37 schrieb J Chris Anderson<jc...@apache.org>:
>>>> 
>>>>> 
>>>>> On Aug 6, 2010, at 9:02 PM, Yue Chuan Lim wrote:
>>>>> 
>>>>>> Sorry to reply myself so quickly.
>>>>>> 
>>>>>> Peeking inside the .couch file and searching for the documents I have
>>>>>> missing turn up results. Offhand I am noticing 4 instances of the string
>>>>>> gsc_test_03. Which is ID of the document I am missing.
>>>>>> 
>>>>> 
>>>>> You are on Windows? Perhaps this is an issue with the windows file handling.
>>>>> 
>>>>> Can you comment on this bug, and also save those .couch files in case we need to analyze them?
>>>>> 
>>>>> Corruption like this should be impossible, but this is the second case we've heard lately, so I'm guessing it is a Windows issue.
>>>>> 
>>>>> Please comment on this bug with information about your machine enviroment (OS, Filesystem, Disk size, etc)
>>>>> 
>>>>> https://issues.apache.org/jira/browse/COUCHDB-844
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> Chris
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Sat, Aug 7, 2010 at 11:58 AM, Yue Chuan Lim<sh...@gmail.com>  wrote:
>>>>>> 
>>>>>>> I have a set of documents that have been committed for more then a day,
>>>>>>> regularly read from without a problem. Had to stop the database service to
>>>>>>> do some debugging, used the couchdb.bat provided in CouchDB/bin for easy
>>>>>>> access to the log. And I noticed that I basically lost all the documents in
>>>>>>> question.
>>>>>>> 
>>>>>>> There does not appear to be corruption per se, but it is as if my database
>>>>>>> just rolled back to the state it was in a few days ago, i.e. most of my
>>>>>>> documents are there but some old documents that I'm pretty sure I have
>>>>>>> deleted are back, and my newer documents are gone.
>>>>>>> 
>>>>>>> Appears to have happened to me more then once, shrugged it off the last
>>>>>>> time as it might be just a mix up, but I am definite that my database has
>>>>>>> certainly rolled back this time.
>>>>>>> 
>>>>>>> Is there any situation in which this might happen?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Yue Chuan
>>>>>>> 
>>>>> 
>>> 
>> 
>

Re: Data loss

Posted by Volker Mische <vo...@gmail.com>.

Damine,

as you found already a suspicious commit, it might not be much help. But 
the GeoCouch merge with 1.0 had also some issues, the original GeoCouch 
not, which is based on a checkout around beginning of March.

Cheers,
   Volker

On 08/07/2010 08:12 PM, Damien Katz wrote:
> SVN commit r954043 looks suspicious. Digging further.
>
> -Damien
>
> On Aug 7, 2010, at 10:31 AM, J Chris Anderson wrote:
>
>>
>> On Aug 7, 2010, at 1:21 AM, Sascha Reuter wrote:
>>
>>> Thats exactly what I reported 2 days ago! Bug is already opened and databasefile was provided to the couchio guys! Running on Linux...
>>>
>>
>> Thanks, we're keenly interested in seeing what's going on here.
>>
>> Chris
>>
>>> Am 07.08.2010 um 06:37 schrieb J Chris Anderson<jc...@apache.org>:
>>>
>>>>
>>>> On Aug 6, 2010, at 9:02 PM, Yue Chuan Lim wrote:
>>>>
>>>>> Sorry to reply myself so quickly.
>>>>>
>>>>> Peeking inside the .couch file and searching for the documents I have
>>>>> missing turn up results. Offhand I am noticing 4 instances of the string
>>>>> gsc_test_03. Which is ID of the document I am missing.
>>>>>
>>>>
>>>> You are on Windows? Perhaps this is an issue with the windows file handling.
>>>>
>>>> Can you comment on this bug, and also save those .couch files in case we need to analyze them?
>>>>
>>>> Corruption like this should be impossible, but this is the second case we've heard lately, so I'm guessing it is a Windows issue.
>>>>
>>>> Please comment on this bug with information about your machine enviroment (OS, Filesystem, Disk size, etc)
>>>>
>>>> https://issues.apache.org/jira/browse/COUCHDB-844
>>>>
>>>> Thanks!
>>>>
>>>> Chris
>>>>
>>>>
>>>>
>>>>> On Sat, Aug 7, 2010 at 11:58 AM, Yue Chuan Lim<sh...@gmail.com>  wrote:
>>>>>
>>>>>> I have a set of documents that have been committed for more then a day,
>>>>>> regularly read from without a problem. Had to stop the database service to
>>>>>> do some debugging, used the couchdb.bat provided in CouchDB/bin for easy
>>>>>> access to the log. And I noticed that I basically lost all the documents in
>>>>>> question.
>>>>>>
>>>>>> There does not appear to be corruption per se, but it is as if my database
>>>>>> just rolled back to the state it was in a few days ago, i.e. most of my
>>>>>> documents are there but some old documents that I'm pretty sure I have
>>>>>> deleted are back, and my newer documents are gone.
>>>>>>
>>>>>> Appears to have happened to me more then once, shrugged it off the last
>>>>>> time as it might be just a mix up, but I am definite that my database has
>>>>>> certainly rolled back this time.
>>>>>>
>>>>>> Is there any situation in which this might happen?
>>>>>>
>>>>>> Thanks
>>>>>> Yue Chuan
>>>>>>
>>>>
>>
>

Re: Data loss

Posted by Damien Katz <da...@apache.org>.

SVN commit r954043 looks suspicious. Digging further.

-Damien

On Aug 7, 2010, at 10:31 AM, J Chris Anderson wrote:

> 
> On Aug 7, 2010, at 1:21 AM, Sascha Reuter wrote:
> 
>> Thats exactly what I reported 2 days ago! Bug is already opened and databasefile was provided to the couchio guys! Running on Linux...
>> 
> 
> Thanks, we're keenly interested in seeing what's going on here.
> 
> Chris
> 
>> Am 07.08.2010 um 06:37 schrieb J Chris Anderson <jc...@apache.org>:
>> 
>>> 
>>> On Aug 6, 2010, at 9:02 PM, Yue Chuan Lim wrote:
>>> 
>>>> Sorry to reply myself so quickly.
>>>> 
>>>> Peeking inside the .couch file and searching for the documents I have
>>>> missing turn up results. Offhand I am noticing 4 instances of the string
>>>> gsc_test_03. Which is ID of the document I am missing.
>>>> 
>>> 
>>> You are on Windows? Perhaps this is an issue with the windows file handling.
>>> 
>>> Can you comment on this bug, and also save those .couch files in case we need to analyze them?
>>> 
>>> Corruption like this should be impossible, but this is the second case we've heard lately, so I'm guessing it is a Windows issue.
>>> 
>>> Please comment on this bug with information about your machine enviroment (OS, Filesystem, Disk size, etc)
>>> 
>>> https://issues.apache.org/jira/browse/COUCHDB-844
>>> 
>>> Thanks!
>>> 
>>> Chris
>>> 
>>> 
>>> 
>>>> On Sat, Aug 7, 2010 at 11:58 AM, Yue Chuan Lim <sh...@gmail.com> wrote:
>>>> 
>>>>> I have a set of documents that have been committed for more then a day,
>>>>> regularly read from without a problem. Had to stop the database service to
>>>>> do some debugging, used the couchdb.bat provided in CouchDB/bin for easy
>>>>> access to the log. And I noticed that I basically lost all the documents in
>>>>> question.
>>>>> 
>>>>> There does not appear to be corruption per se, but it is as if my database
>>>>> just rolled back to the state it was in a few days ago, i.e. most of my
>>>>> documents are there but some old documents that I'm pretty sure I have
>>>>> deleted are back, and my newer documents are gone.
>>>>> 
>>>>> Appears to have happened to me more then once, shrugged it off the last
>>>>> time as it might be just a mix up, but I am definite that my database has
>>>>> certainly rolled back this time.
>>>>> 
>>>>> Is there any situation in which this might happen?
>>>>> 
>>>>> Thanks
>>>>> Yue Chuan
>>>>> 
>>> 
>

Re: Data loss

Posted by J Chris Anderson <jc...@apache.org>.

On Aug 7, 2010, at 1:21 AM, Sascha Reuter wrote:

> Thats exactly what I reported 2 days ago! Bug is already opened and databasefile was provided to the couchio guys! Running on Linux...
> 

Thanks, we're keenly interested in seeing what's going on here.

Chris

> Am 07.08.2010 um 06:37 schrieb J Chris Anderson <jc...@apache.org>:
> 
>> 
>> On Aug 6, 2010, at 9:02 PM, Yue Chuan Lim wrote:
>> 
>>> Sorry to reply myself so quickly.
>>> 
>>> Peeking inside the .couch file and searching for the documents I have
>>> missing turn up results. Offhand I am noticing 4 instances of the string
>>> gsc_test_03. Which is ID of the document I am missing.
>>> 
>> 
>> You are on Windows? Perhaps this is an issue with the windows file handling.
>> 
>> Can you comment on this bug, and also save those .couch files in case we need to analyze them?
>> 
>> Corruption like this should be impossible, but this is the second case we've heard lately, so I'm guessing it is a Windows issue.
>> 
>> Please comment on this bug with information about your machine enviroment (OS, Filesystem, Disk size, etc)
>> 
>> https://issues.apache.org/jira/browse/COUCHDB-844
>> 
>> Thanks!
>> 
>> Chris
>> 
>> 
>> 
>>> On Sat, Aug 7, 2010 at 11:58 AM, Yue Chuan Lim <sh...@gmail.com> wrote:
>>> 
>>>> I have a set of documents that have been committed for more then a day,
>>>> regularly read from without a problem. Had to stop the database service to
>>>> do some debugging, used the couchdb.bat provided in CouchDB/bin for easy
>>>> access to the log. And I noticed that I basically lost all the documents in
>>>> question.
>>>> 
>>>> There does not appear to be corruption per se, but it is as if my database
>>>> just rolled back to the state it was in a few days ago, i.e. most of my
>>>> documents are there but some old documents that I'm pretty sure I have
>>>> deleted are back, and my newer documents are gone.
>>>> 
>>>> Appears to have happened to me more then once, shrugged it off the last
>>>> time as it might be just a mix up, but I am definite that my database has
>>>> certainly rolled back this time.
>>>> 
>>>> Is there any situation in which this might happen?
>>>> 
>>>> Thanks
>>>> Yue Chuan
>>>> 
>>

Re: Data loss

Posted by Sascha Reuter <s....@geek-it.de>.

Thats exactly what I reported 2 days ago! Bug is already opened and databasefile was provided to the couchio guys! Running on Linux...

Am 07.08.2010 um 06:37 schrieb J Chris Anderson <jc...@apache.org>:

> 
> On Aug 6, 2010, at 9:02 PM, Yue Chuan Lim wrote:
> 
>> Sorry to reply myself so quickly.
>> 
>> Peeking inside the .couch file and searching for the documents I have
>> missing turn up results. Offhand I am noticing 4 instances of the string
>> gsc_test_03. Which is ID of the document I am missing.
>> 
> 
> You are on Windows? Perhaps this is an issue with the windows file handling.
> 
> Can you comment on this bug, and also save those .couch files in case we need to analyze them?
> 
> Corruption like this should be impossible, but this is the second case we've heard lately, so I'm guessing it is a Windows issue.
> 
> Please comment on this bug with information about your machine enviroment (OS, Filesystem, Disk size, etc)
> 
> https://issues.apache.org/jira/browse/COUCHDB-844
> 
> Thanks!
> 
> Chris
> 
> 
> 
>> On Sat, Aug 7, 2010 at 11:58 AM, Yue Chuan Lim <sh...@gmail.com> wrote:
>> 
>>> I have a set of documents that have been committed for more then a day,
>>> regularly read from without a problem. Had to stop the database service to
>>> do some debugging, used the couchdb.bat provided in CouchDB/bin for easy
>>> access to the log. And I noticed that I basically lost all the documents in
>>> question.
>>> 
>>> There does not appear to be corruption per se, but it is as if my database
>>> just rolled back to the state it was in a few days ago, i.e. most of my
>>> documents are there but some old documents that I'm pretty sure I have
>>> deleted are back, and my newer documents are gone.
>>> 
>>> Appears to have happened to me more then once, shrugged it off the last
>>> time as it might be just a mix up, but I am definite that my database has
>>> certainly rolled back this time.
>>> 
>>> Is there any situation in which this might happen?
>>> 
>>> Thanks
>>> Yue Chuan
>>> 
>

Re: Data loss

Posted by J Chris Anderson <jc...@apache.org>.

On Aug 6, 2010, at 9:02 PM, Yue Chuan Lim wrote:

> Sorry to reply myself so quickly.
> 
> Peeking inside the .couch file and searching for the documents I have
> missing turn up results. Offhand I am noticing 4 instances of the string
> gsc_test_03. Which is ID of the document I am missing.
> 

You are on Windows? Perhaps this is an issue with the windows file handling.

Can you comment on this bug, and also save those .couch files in case we need to analyze them?

Corruption like this should be impossible, but this is the second case we've heard lately, so I'm guessing it is a Windows issue.

Please comment on this bug with information about your machine enviroment (OS, Filesystem, Disk size, etc)

https://issues.apache.org/jira/browse/COUCHDB-844

Thanks!

Chris



> On Sat, Aug 7, 2010 at 11:58 AM, Yue Chuan Lim <sh...@gmail.com> wrote:
> 
>> I have a set of documents that have been committed for more then a day,
>> regularly read from without a problem. Had to stop the database service to
>> do some debugging, used the couchdb.bat provided in CouchDB/bin for easy
>> access to the log. And I noticed that I basically lost all the documents in
>> question.
>> 
>> There does not appear to be corruption per se, but it is as if my database
>> just rolled back to the state it was in a few days ago, i.e. most of my
>> documents are there but some old documents that I'm pretty sure I have
>> deleted are back, and my newer documents are gone.
>> 
>> Appears to have happened to me more then once, shrugged it off the last
>> time as it might be just a mix up, but I am definite that my database has
>> certainly rolled back this time.
>> 
>> Is there any situation in which this might happen?
>> 
>> Thanks
>> Yue Chuan
>>

Re: Data loss

Posted by Yue Chuan Lim <sh...@gmail.com>.

Sorry to reply myself so quickly.

Peeking inside the .couch file and searching for the documents I have
missing turn up results. Offhand I am noticing 4 instances of the string
gsc_test_03. Which is ID of the document I am missing.

On Sat, Aug 7, 2010 at 11:58 AM, Yue Chuan Lim <sh...@gmail.com> wrote:

> I have a set of documents that have been committed for more then a day,
> regularly read from without a problem. Had to stop the database service to
> do some debugging, used the couchdb.bat provided in CouchDB/bin for easy
> access to the log. And I noticed that I basically lost all the documents in
> question.
>
> There does not appear to be corruption per se, but it is as if my database
> just rolled back to the state it was in a few days ago, i.e. most of my
> documents are there but some old documents that I'm pretty sure I have
> deleted are back, and my newer documents are gone.
>
> Appears to have happened to me more then once, shrugged it off the last
> time as it might be just a mix up, but I am definite that my database has
> certainly rolled back this time.
>
> Is there any situation in which this might happen?
>
> Thanks
> Yue Chuan
>