You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Victor Nicollet <vn...@runorg.com> on 2013/04/18 14:17:38 UTC

Corrupted database example file

Hello,

The @CouchDB twitter account thought you might find this information
helpful.

My SaaS start-up uses CouchDB as its primary database. Lately, I have been
having database corruption issues with version 1.2.0 : every few weeks, one
of our databases becomes corrupted, which has several negative consequences
(among others) :

   - Replication of that database fails (it does not even start).
   - Compaction of that database fails and *freezes* the server.
   - Several documents in the database become inaccessible through either
   direct access or through _all_docs.

 The latest affected database does not contain any information about our
customers, so I am allowed to release it publicly :

http://nicollet.net/public/2013-04-18.couchdb/prod-folder.couch

This database contains 325 irretrievable documents between identifiers
2xFEY0pU2Eb and 3Fn6l04G6Oa.
I hope this helps,

-- 
Victor Nicollet, CTO, www.runorg.com

Re: Corrupted database example file

Posted by Benoit Chesneau <bc...@gmail.com>.

Can you provide any dmesg logs and anything you have to check your
disks ? That could help us to check if it's a couchdb bug or the
result of a faulty hardware.

- benoit

On Thu, Apr 18, 2013 at 2:17 PM, Victor Nicollet <vn...@runorg.com> wrote:
> Hello,
>
> The @CouchDB twitter account thought you might find this information
> helpful.
>
> My SaaS start-up uses CouchDB as its primary database. Lately, I have been
> having database corruption issues with version 1.2.0 : every few weeks, one
> of our databases becomes corrupted, which has several negative consequences
> (among others) :
>
>    - Replication of that database fails (it does not even start).
>    - Compaction of that database fails and *freezes* the server.
>    - Several documents in the database become inaccessible through either
>    direct access or through _all_docs.
>
>  The latest affected database does not contain any information about our
> customers, so I am allowed to release it publicly :
>
> http://nicollet.net/public/2013-04-18.couchdb/prod-folder.couch
>
> This database contains 325 irretrievable documents between identifiers
> 2xFEY0pU2Eb and 3Fn6l04G6Oa.
> I hope this helps,
>
> --
> Victor Nicollet, CTO, www.runorg.com

Re: Corrupted database example file

Posted by Paul Davis <pa...@gmail.com>.

Victor,

I finally remembered to ask a few of the ops guys I work with while
they were online about things to run to check for faulty hardware. The
general suggests for detecting disk errors are first to check dmesg
and /var/log/messages for anything that looks amiss, and then run fsck
and smartctl to check the filesystem integrity and smartctl will let
you know if the disk thinks its broken.

You may also want to run a RAM test on the machine. I'm told that most
BIOS's should have a utility for doing that these days. Otherwise
theres' memtest86+ that's a downloadable ISO. They say if you can to
just let that run overnight and if the machine is frozen in the
morning you've found the issue.

HTH,
Paul Davis

On Thu, Apr 18, 2013 at 5:07 PM, Victor Nicollet <vn...@runorg.com> wrote:
> Replying to my own mail, hoping it will end up in the same thread (I was
> not fully subscribed when I posted this, but I still read the archives).
>
> Answers to the questions you asked :
>
>  - I have no idea when the issue happened. I will try to track it down in
> the logs. I'm afraid I don't have time to filter out all customer
> information from the logs and provide them to you, though I can certainly
> grep for error dumps if you want me to. I have never seen disk-related
> errors in the log.
>  - I am running Debian x86_64 GNU/Linux, with erlang 1:15.b.1-d
>  - There are no unusual CouchDB configuration options ; the only change I
> performed was to disable reduce_limit. A perhaps notable usage aspect : all
> the databases are compacted hourly.
>  - It's not NFS. From /etc/fstab :
>
> /dev/sda1       /       ext4    errors=remount-ro       0       1
> /dev/sda2       /home   ext4    defaults                0       2
>
> The dual-partition setup is a silly default from OVH (my dedicated server
> host), so I have /var/lib/couchdb as a symlink to /home/couchdb/lib, from
> sda1 to sda2.
>
> - I can't rule out a disk issue, because I don't have a lot of experience
> with those... any obvious diagnosis command you would like me to run ? I am
> certain that I have not run out of disk space, though (still around 1TB
> free on that drive).
>
> Thank you for your patience.
>
> On 18 April 2013 14:17, Victor Nicollet <vn...@runorg.com> wrote:
>
>> Hello,
>>
>> The @CouchDB twitter account thought you might find this information
>> helpful.
>>
>> My SaaS start-up uses CouchDB as its primary database. Lately, I have been
>> having database corruption issues with version 1.2.0 : every few weeks, one
>> of our databases becomes corrupted, which has several negative consequences
>> (among others) :
>>
>>    - Replication of that database fails (it does not even start).
>>    - Compaction of that database fails and *freezes* the server.
>>    - Several documents in the database become inaccessible through either
>>    direct access or through _all_docs.
>>
>>  The latest affected database does not contain any information about our
>> customers, so I am allowed to release it publicly :
>>
>> http://nicollet.net/public/2013-04-18.couchdb/prod-folder.couch
>>
>> This database contains 325 irretrievable documents between identifiers
>> 2xFEY0pU2Eb and 3Fn6l04G6Oa.
>> I hope this helps,
>>
>> --
>> Victor Nicollet, CTO, www.runorg.com
>>
>
>
>
> --
> Victor Nicollet, Directeur Technique, www.runorg.com

Re: Corrupted database example file

Posted by Wendall Cada <we...@apache.org>.

Thanks for the feedback on this Paul, this is outside of my area of 
expertise, so nice to know that this isn't symptomatic of using 
delayed_commits = true.

I also agree that this appears to be a hardware issue, and the only way 
to confirm would be to mirror the setup on some separate hardware and 
see if the issue persists.

Wendall

On 04/19/2013 12:40 PM, Paul Davis wrote:
> Doubtful that delayed commits would cause this. This isn't a matter of
> reordered writes or some writes not making it to disk. The binary
> would've been pushed towards disk in a single write request and the
> corruption appears to be in the middle of valid data which is a bit
> weird.
>
> My guess is this was either corrupted in RAM somehow before it hit
> disk or somehow the disk is returning bad reads. I've seen similar
> things before that end up preceding disk death but I'm also running a
> comparatively older code base (most importantly, no snappy).
>
> On Fri, Apr 19, 2013 at 11:42 AM, Wendall Cada <we...@apache.org> wrote:
>> If using the defaults isn't this set to delayed_commits = true still? Can't
>> this lead to just this type of data corruption? I'd like to see
>> delayed_commits = false and see if this is still happening.
>>
>> I'd also be keen on seeing this data replicated to a different piece of
>> hardware with the same compaction schedule and see if the issue persists.
>> I'm inclined to point the finger at a hard disk issue, but would like to see
>> some confirmation that this can be reproduced with the same exact code on
>> different hardware.
>>
>> I've run this same version heavily in production on several different
>> systems doing essentially the same thing and have never seen a data
>> corruption. The main difference is I always use delayed_commits = false
>>
>> Wendall
>>
>>
>> On 04/19/2013 01:31 AM, Dave Cottlehuber wrote:
>>> On 19 April 2013 00:41, Victor Nicollet <vn...@runorg.com> wrote:
>>>> I searched the logs for any signs of error. The operations performed on
>>>> the
>>>> prod-folder database in the two hours before the first crash were :
>>>>
>>>> https://gist.github.com/VictorNicollet/878d0176960cc71d9ac1
>>>>
>>>> The compact at 10:54:08 finished without a hitch.
>>>> The compact at 11:54:07 finished with :
>>>>
>>>> https://gist.github.com/VictorNicollet/4d6ccd60bec2ae922a32
>>>>
>>> Hi Victor,
>>>
>>> thanks for that information.
>>>
>>> Can we get a working copy of the database, so we can compare the
>>> corrupt compressed documents with the working ones and see if there's
>>> any pattern?
>>>
>>> I recommend you assume there's some storage system issue and:
>>>
>>> - check dmesg / syslog for disk related errors
>>> - fsck the filesystem where the couches are
>>> - if this is a managed / hosted server you might want to get the
>>> supplier to check if there are any disk / storage issues
>>> - if it's not virtualised hardware, see if smartmontools tells you
>>> anything useful
>>>
>>> If you wish, you can encrypt files using my public key,
>>> http://www.apache.org/dist/couchdb/KEYS dch@ apache.org.
>>>
>>> A+
>>> Dave
>>

Re: Corrupted database example file

Posted by Paul Davis <pa...@gmail.com>.

Doubtful that delayed commits would cause this. This isn't a matter of
reordered writes or some writes not making it to disk. The binary
would've been pushed towards disk in a single write request and the
corruption appears to be in the middle of valid data which is a bit
weird.

My guess is this was either corrupted in RAM somehow before it hit
disk or somehow the disk is returning bad reads. I've seen similar
things before that end up preceding disk death but I'm also running a
comparatively older code base (most importantly, no snappy).

On Fri, Apr 19, 2013 at 11:42 AM, Wendall Cada <we...@apache.org> wrote:
> If using the defaults isn't this set to delayed_commits = true still? Can't
> this lead to just this type of data corruption? I'd like to see
> delayed_commits = false and see if this is still happening.
>
> I'd also be keen on seeing this data replicated to a different piece of
> hardware with the same compaction schedule and see if the issue persists.
> I'm inclined to point the finger at a hard disk issue, but would like to see
> some confirmation that this can be reproduced with the same exact code on
> different hardware.
>
> I've run this same version heavily in production on several different
> systems doing essentially the same thing and have never seen a data
> corruption. The main difference is I always use delayed_commits = false
>
> Wendall
>
>
> On 04/19/2013 01:31 AM, Dave Cottlehuber wrote:
>>
>> On 19 April 2013 00:41, Victor Nicollet <vn...@runorg.com> wrote:
>>>
>>> I searched the logs for any signs of error. The operations performed on
>>> the
>>> prod-folder database in the two hours before the first crash were :
>>>
>>> https://gist.github.com/VictorNicollet/878d0176960cc71d9ac1
>>>
>>> The compact at 10:54:08 finished without a hitch.
>>> The compact at 11:54:07 finished with :
>>>
>>> https://gist.github.com/VictorNicollet/4d6ccd60bec2ae922a32
>>>
>> Hi Victor,
>>
>> thanks for that information.
>>
>> Can we get a working copy of the database, so we can compare the
>> corrupt compressed documents with the working ones and see if there's
>> any pattern?
>>
>> I recommend you assume there's some storage system issue and:
>>
>> - check dmesg / syslog for disk related errors
>> - fsck the filesystem where the couches are
>> - if this is a managed / hosted server you might want to get the
>> supplier to check if there are any disk / storage issues
>> - if it's not virtualised hardware, see if smartmontools tells you
>> anything useful
>>
>> If you wish, you can encrypt files using my public key,
>> http://www.apache.org/dist/couchdb/KEYS dch@ apache.org.
>>
>> A+
>> Dave
>
>

Re: Corrupted database example file

Posted by Wendall Cada <we...@apache.org>.

If using the defaults isn't this set to delayed_commits = true still? 
Can't this lead to just this type of data corruption? I'd like to see 
delayed_commits = false and see if this is still happening.

I'd also be keen on seeing this data replicated to a different piece of 
hardware with the same compaction schedule and see if the issue 
persists. I'm inclined to point the finger at a hard disk issue, but 
would like to see some confirmation that this can be reproduced with the 
same exact code on different hardware.

I've run this same version heavily in production on several different 
systems doing essentially the same thing and have never seen a data 
corruption. The main difference is I always use delayed_commits = false

Wendall

On 04/19/2013 01:31 AM, Dave Cottlehuber wrote:
> On 19 April 2013 00:41, Victor Nicollet <vn...@runorg.com> wrote:
>> I searched the logs for any signs of error. The operations performed on the
>> prod-folder database in the two hours before the first crash were :
>>
>> https://gist.github.com/VictorNicollet/878d0176960cc71d9ac1
>>
>> The compact at 10:54:08 finished without a hitch.
>> The compact at 11:54:07 finished with :
>>
>> https://gist.github.com/VictorNicollet/4d6ccd60bec2ae922a32
>>
> Hi Victor,
>
> thanks for that information.
>
> Can we get a working copy of the database, so we can compare the
> corrupt compressed documents with the working ones and see if there's
> any pattern?
>
> I recommend you assume there's some storage system issue and:
>
> - check dmesg / syslog for disk related errors
> - fsck the filesystem where the couches are
> - if this is a managed / hosted server you might want to get the
> supplier to check if there are any disk / storage issues
> - if it's not virtualised hardware, see if smartmontools tells you
> anything useful
>
> If you wish, you can encrypt files using my public key,
> http://www.apache.org/dist/couchdb/KEYS dch@ apache.org.
>
> A+
> Dave

Re: Corrupted database example file

Posted by Dave Cottlehuber <dc...@jsonified.com>.

On 19 April 2013 00:41, Victor Nicollet <vn...@runorg.com> wrote:
> I searched the logs for any signs of error. The operations performed on the
> prod-folder database in the two hours before the first crash were :
>
> https://gist.github.com/VictorNicollet/878d0176960cc71d9ac1
>
> The compact at 10:54:08 finished without a hitch.
> The compact at 11:54:07 finished with :
>
> https://gist.github.com/VictorNicollet/4d6ccd60bec2ae922a32
>

Hi Victor,

thanks for that information.

Can we get a working copy of the database, so we can compare the
corrupt compressed documents with the working ones and see if there's
any pattern?

I recommend you assume there's some storage system issue and:

- check dmesg / syslog for disk related errors
- fsck the filesystem where the couches are
- if this is a managed / hosted server you might want to get the
supplier to check if there are any disk / storage issues
- if it's not virtualised hardware, see if smartmontools tells you
anything useful

If you wish, you can encrypt files using my public key,
http://www.apache.org/dist/couchdb/KEYS dch@ apache.org.

A+
Dave

Re: Corrupted database example file

Posted by Victor Nicollet <vn...@runorg.com>.

I searched the logs for any signs of error. The operations performed on the
prod-folder database in the two hours before the first crash were :

https://gist.github.com/VictorNicollet/878d0176960cc71d9ac1

The compact at 10:54:08 finished without a hitch.
The compact at 11:54:07 finished with :

https://gist.github.com/VictorNicollet/4d6ccd60bec2ae922a32



On 19 April 2013 00:17, Victor Nicollet <vn...@runorg.com> wrote:

> It had happened once on a critical production database (the user
> database...) so I wrote some code to repair it. And I never throw away any
> code.
>
> If you're interested (but I doubt it : it's pretty useless), I could share
> the repair code.
>
> More info on the logs : apparently, the first compact-related crash
> happened Wed, 17 Apr 2013 11:54:08 GMT : since I have hourly compacts, it
> means the corruption happened Wed, 17 Apr 2013 10:54:08 GMT at the
> earliest. Sifting through that period right now...
>
>
> On 19 April 2013 00:13, Robert Newson <rn...@apache.org> wrote:
>
>> You say this happens often? Clearly often enough that you have a
>> routine to repair it.
>>
>> B.
>>
>> On 18 April 2013 23:12, Robert Newson <rn...@apache.org> wrote:
>> > Hi Victor,
>> >
>> > Thanks for the information, we appreciate it.
>> >
>> > B.
>> >
>> > On 18 April 2013 23:07, Victor Nicollet <vn...@runorg.com> wrote:
>> >> Replying to my own mail, hoping it will end up in the same thread (I
>> was
>> >> not fully subscribed when I posted this, but I still read the
>> archives).
>> >>
>> >> Answers to the questions you asked :
>> >>
>> >>  - I have no idea when the issue happened. I will try to track it down
>> in
>> >> the logs. I'm afraid I don't have time to filter out all customer
>> >> information from the logs and provide them to you, though I can
>> certainly
>> >> grep for error dumps if you want me to. I have never seen disk-related
>> >> errors in the log.
>> >>  - I am running Debian x86_64 GNU/Linux, with erlang 1:15.b.1-d
>> >>  - There are no unusual CouchDB configuration options ; the only
>> change I
>> >> performed was to disable reduce_limit. A perhaps notable usage aspect
>> : all
>> >> the databases are compacted hourly.
>> >>  - It's not NFS. From /etc/fstab :
>> >>
>> >> /dev/sda1       /       ext4    errors=remount-ro       0       1
>> >> /dev/sda2       /home   ext4    defaults                0       2
>> >>
>> >> The dual-partition setup is a silly default from OVH (my dedicated
>> server
>> >> host), so I have /var/lib/couchdb as a symlink to /home/couchdb/lib,
>> from
>> >> sda1 to sda2.
>> >>
>> >> - I can't rule out a disk issue, because I don't have a lot of
>> experience
>> >> with those... any obvious diagnosis command you would like me to run ?
>> I am
>> >> certain that I have not run out of disk space, though (still around 1TB
>> >> free on that drive).
>> >>
>> >> Thank you for your patience.
>> >>
>> >> On 18 April 2013 14:17, Victor Nicollet <vn...@runorg.com> wrote:
>> >>
>> >>> Hello,
>> >>>
>> >>> The @CouchDB twitter account thought you might find this information
>> >>> helpful.
>> >>>
>> >>> My SaaS start-up uses CouchDB as its primary database. Lately, I have
>> been
>> >>> having database corruption issues with version 1.2.0 : every few
>> weeks, one
>> >>> of our databases becomes corrupted, which has several negative
>> consequences
>> >>> (among others) :
>> >>>
>> >>>    - Replication of that database fails (it does not even start).
>> >>>    - Compaction of that database fails and *freezes* the server.
>> >>>    - Several documents in the database become inaccessible through
>> either
>> >>>    direct access or through _all_docs.
>> >>>
>> >>>  The latest affected database does not contain any information about
>> our
>> >>> customers, so I am allowed to release it publicly :
>> >>>
>> >>> http://nicollet.net/public/2013-04-18.couchdb/prod-folder.couch
>> >>>
>> >>> This database contains 325 irretrievable documents between identifiers
>> >>> 2xFEY0pU2Eb and 3Fn6l04G6Oa.
>> >>> I hope this helps,
>> >>>
>> >>> --
>> >>> Victor Nicollet, CTO, www.runorg.com
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Victor Nicollet, Directeur Technique, www.runorg.com
>>
>
>
>
> --
> Victor Nicollet, Directeur Technique, www.runorg.com
>



-- 
Victor Nicollet, Directeur Technique, www.runorg.com

Re: Corrupted database example file

Posted by Ashraf Janan <as...@yahoo.com>.

Thank you very much Victor,how can i open this folder, which program i must used?
--- On Fri, 4/19/13, Victor Nicollet <vn...@runorg.com> wrote:

From: Victor Nicollet <vn...@runorg.com>
Subject: Re: Corrupted database example file
To: dev@couchdb.apache.org
Date: Friday, April 19, 2013, 12:17 AM

It had happened once on a critical production database (the user
database...) so I wrote some code to repair it. And I never throw away any
code.

If you're interested (but I doubt it : it's pretty useless), I could share
the repair code.

More info on the logs : apparently, the first compact-related crash
happened Wed, 17 Apr 2013 11:54:08 GMT : since I have hourly compacts, it
means the corruption happened Wed, 17 Apr 2013 10:54:08 GMT at the
earliest. Sifting through that period right now...


On 19 April 2013 00:13, Robert Newson <rn...@apache.org> wrote:

> You say this happens often? Clearly often enough that you have a
> routine to repair it.
>
> B.
>
> On 18 April 2013 23:12, Robert Newson <rn...@apache.org> wrote:
> > Hi Victor,
> >
> > Thanks for the information, we appreciate it.
> >
> > B.
> >
> > On 18 April 2013 23:07, Victor Nicollet <vn...@runorg.com> wrote:
> >> Replying to my own mail, hoping it will end up in the same thread (I was
> >> not fully subscribed when I posted this, but I still read the archives).
> >>
> >> Answers to the questions you asked :
> >>
> >>  - I have no idea when the issue happened. I will try to track it down
> in
> >> the logs. I'm afraid I don't have time to filter out all customer
> >> information from the logs and provide them to you, though I can
> certainly
> >> grep for error dumps if you want me to. I have never seen disk-related
> >> errors in the log.
> >>  - I am running Debian x86_64 GNU/Linux, with erlang 1:15.b.1-d
> >>  - There are no unusual CouchDB configuration options ; the only change
> I
> >> performed was to disable reduce_limit. A perhaps notable usage aspect :
> all
> >> the databases are compacted hourly.
> >>  - It's not NFS. From /etc/fstab :
> >>
> >> /dev/sda1       /       ext4    errors=remount-ro       0       1
> >> /dev/sda2       /home   ext4    defaults                0       2
> >>
> >> The dual-partition setup is a silly default from OVH (my dedicated
> server
> >> host), so I have /var/lib/couchdb as a symlink to /home/couchdb/lib,
> from
> >> sda1 to sda2.
> >>
> >> - I can't rule out a disk issue, because I don't have a lot of
> experience
> >> with those... any obvious diagnosis command you would like me to run ?
> I am
> >> certain that I have not run out of disk space, though (still around 1TB
> >> free on that drive).
> >>
> >> Thank you for your patience.
> >>
> >> On 18 April 2013 14:17, Victor Nicollet <vn...@runorg.com> wrote:
> >>
> >>> Hello,
> >>>
> >>> The @CouchDB twitter account thought you might find this information
> >>> helpful.
> >>>
> >>> My SaaS start-up uses CouchDB as its primary database. Lately, I have
> been
> >>> having database corruption issues with version 1.2.0 : every few
> weeks, one
> >>> of our databases becomes corrupted, which has several negative
> consequences
> >>> (among others) :
> >>>
> >>>    - Replication of that database fails (it does not even start).
> >>>    - Compaction of that database fails and *freezes* the server.
> >>>    - Several documents in the database become inaccessible through
> either
> >>>    direct access or through _all_docs.
> >>>
> >>>  The latest affected database does not contain any information about
> our
> >>> customers, so I am allowed to release it publicly :
> >>>
> >>> http://nicollet.net/public/2013-04-18.couchdb/prod-folder.couch
> >>>
> >>> This database contains 325 irretrievable documents between identifiers
> >>> 2xFEY0pU2Eb and 3Fn6l04G6Oa.
> >>> I hope this helps,
> >>>
> >>> --
> >>> Victor Nicollet, CTO, www.runorg.com
> >>>
> >>
> >>
> >>
> >> --
> >> Victor Nicollet, Directeur Technique, www.runorg.com
>



-- 
Victor Nicollet, Directeur Technique, www.runorg.com

Re: Corrupted database example file

Posted by Victor Nicollet <vn...@runorg.com>.

It had happened once on a critical production database (the user
database...) so I wrote some code to repair it. And I never throw away any
code.

If you're interested (but I doubt it : it's pretty useless), I could share
the repair code.

More info on the logs : apparently, the first compact-related crash
happened Wed, 17 Apr 2013 11:54:08 GMT : since I have hourly compacts, it
means the corruption happened Wed, 17 Apr 2013 10:54:08 GMT at the
earliest. Sifting through that period right now...


On 19 April 2013 00:13, Robert Newson <rn...@apache.org> wrote:

> You say this happens often? Clearly often enough that you have a
> routine to repair it.
>
> B.
>
> On 18 April 2013 23:12, Robert Newson <rn...@apache.org> wrote:
> > Hi Victor,
> >
> > Thanks for the information, we appreciate it.
> >
> > B.
> >
> > On 18 April 2013 23:07, Victor Nicollet <vn...@runorg.com> wrote:
> >> Replying to my own mail, hoping it will end up in the same thread (I was
> >> not fully subscribed when I posted this, but I still read the archives).
> >>
> >> Answers to the questions you asked :
> >>
> >>  - I have no idea when the issue happened. I will try to track it down
> in
> >> the logs. I'm afraid I don't have time to filter out all customer
> >> information from the logs and provide them to you, though I can
> certainly
> >> grep for error dumps if you want me to. I have never seen disk-related
> >> errors in the log.
> >>  - I am running Debian x86_64 GNU/Linux, with erlang 1:15.b.1-d
> >>  - There are no unusual CouchDB configuration options ; the only change
> I
> >> performed was to disable reduce_limit. A perhaps notable usage aspect :
> all
> >> the databases are compacted hourly.
> >>  - It's not NFS. From /etc/fstab :
> >>
> >> /dev/sda1       /       ext4    errors=remount-ro       0       1
> >> /dev/sda2       /home   ext4    defaults                0       2
> >>
> >> The dual-partition setup is a silly default from OVH (my dedicated
> server
> >> host), so I have /var/lib/couchdb as a symlink to /home/couchdb/lib,
> from
> >> sda1 to sda2.
> >>
> >> - I can't rule out a disk issue, because I don't have a lot of
> experience
> >> with those... any obvious diagnosis command you would like me to run ?
> I am
> >> certain that I have not run out of disk space, though (still around 1TB
> >> free on that drive).
> >>
> >> Thank you for your patience.
> >>
> >> On 18 April 2013 14:17, Victor Nicollet <vn...@runorg.com> wrote:
> >>
> >>> Hello,
> >>>
> >>> The @CouchDB twitter account thought you might find this information
> >>> helpful.
> >>>
> >>> My SaaS start-up uses CouchDB as its primary database. Lately, I have
> been
> >>> having database corruption issues with version 1.2.0 : every few
> weeks, one
> >>> of our databases becomes corrupted, which has several negative
> consequences
> >>> (among others) :
> >>>
> >>>    - Replication of that database fails (it does not even start).
> >>>    - Compaction of that database fails and *freezes* the server.
> >>>    - Several documents in the database become inaccessible through
> either
> >>>    direct access or through _all_docs.
> >>>
> >>>  The latest affected database does not contain any information about
> our
> >>> customers, so I am allowed to release it publicly :
> >>>
> >>> http://nicollet.net/public/2013-04-18.couchdb/prod-folder.couch
> >>>
> >>> This database contains 325 irretrievable documents between identifiers
> >>> 2xFEY0pU2Eb and 3Fn6l04G6Oa.
> >>> I hope this helps,
> >>>
> >>> --
> >>> Victor Nicollet, CTO, www.runorg.com
> >>>
> >>
> >>
> >>
> >> --
> >> Victor Nicollet, Directeur Technique, www.runorg.com
>



-- 
Victor Nicollet, Directeur Technique, www.runorg.com

Re: Corrupted database example file

Posted by Robert Newson <rn...@apache.org>.

You say this happens often? Clearly often enough that you have a
routine to repair it.

B.

On 18 April 2013 23:12, Robert Newson <rn...@apache.org> wrote:
> Hi Victor,
>
> Thanks for the information, we appreciate it.
>
> B.
>
> On 18 April 2013 23:07, Victor Nicollet <vn...@runorg.com> wrote:
>> Replying to my own mail, hoping it will end up in the same thread (I was
>> not fully subscribed when I posted this, but I still read the archives).
>>
>> Answers to the questions you asked :
>>
>>  - I have no idea when the issue happened. I will try to track it down in
>> the logs. I'm afraid I don't have time to filter out all customer
>> information from the logs and provide them to you, though I can certainly
>> grep for error dumps if you want me to. I have never seen disk-related
>> errors in the log.
>>  - I am running Debian x86_64 GNU/Linux, with erlang 1:15.b.1-d
>>  - There are no unusual CouchDB configuration options ; the only change I
>> performed was to disable reduce_limit. A perhaps notable usage aspect : all
>> the databases are compacted hourly.
>>  - It's not NFS. From /etc/fstab :
>>
>> /dev/sda1       /       ext4    errors=remount-ro       0       1
>> /dev/sda2       /home   ext4    defaults                0       2
>>
>> The dual-partition setup is a silly default from OVH (my dedicated server
>> host), so I have /var/lib/couchdb as a symlink to /home/couchdb/lib, from
>> sda1 to sda2.
>>
>> - I can't rule out a disk issue, because I don't have a lot of experience
>> with those... any obvious diagnosis command you would like me to run ? I am
>> certain that I have not run out of disk space, though (still around 1TB
>> free on that drive).
>>
>> Thank you for your patience.
>>
>> On 18 April 2013 14:17, Victor Nicollet <vn...@runorg.com> wrote:
>>
>>> Hello,
>>>
>>> The @CouchDB twitter account thought you might find this information
>>> helpful.
>>>
>>> My SaaS start-up uses CouchDB as its primary database. Lately, I have been
>>> having database corruption issues with version 1.2.0 : every few weeks, one
>>> of our databases becomes corrupted, which has several negative consequences
>>> (among others) :
>>>
>>>    - Replication of that database fails (it does not even start).
>>>    - Compaction of that database fails and *freezes* the server.
>>>    - Several documents in the database become inaccessible through either
>>>    direct access or through _all_docs.
>>>
>>>  The latest affected database does not contain any information about our
>>> customers, so I am allowed to release it publicly :
>>>
>>> http://nicollet.net/public/2013-04-18.couchdb/prod-folder.couch
>>>
>>> This database contains 325 irretrievable documents between identifiers
>>> 2xFEY0pU2Eb and 3Fn6l04G6Oa.
>>> I hope this helps,
>>>
>>> --
>>> Victor Nicollet, CTO, www.runorg.com
>>>
>>
>>
>>
>> --
>> Victor Nicollet, Directeur Technique, www.runorg.com

Re: Corrupted database example file

Posted by Robert Newson <rn...@apache.org>.

Hi Victor,

Thanks for the information, we appreciate it.

B.

On 18 April 2013 23:07, Victor Nicollet <vn...@runorg.com> wrote:
> Replying to my own mail, hoping it will end up in the same thread (I was
> not fully subscribed when I posted this, but I still read the archives).
>
> Answers to the questions you asked :
>
>  - I have no idea when the issue happened. I will try to track it down in
> the logs. I'm afraid I don't have time to filter out all customer
> information from the logs and provide them to you, though I can certainly
> grep for error dumps if you want me to. I have never seen disk-related
> errors in the log.
>  - I am running Debian x86_64 GNU/Linux, with erlang 1:15.b.1-d
>  - There are no unusual CouchDB configuration options ; the only change I
> performed was to disable reduce_limit. A perhaps notable usage aspect : all
> the databases are compacted hourly.
>  - It's not NFS. From /etc/fstab :
>
> /dev/sda1       /       ext4    errors=remount-ro       0       1
> /dev/sda2       /home   ext4    defaults                0       2
>
> The dual-partition setup is a silly default from OVH (my dedicated server
> host), so I have /var/lib/couchdb as a symlink to /home/couchdb/lib, from
> sda1 to sda2.
>
> - I can't rule out a disk issue, because I don't have a lot of experience
> with those... any obvious diagnosis command you would like me to run ? I am
> certain that I have not run out of disk space, though (still around 1TB
> free on that drive).
>
> Thank you for your patience.
>
> On 18 April 2013 14:17, Victor Nicollet <vn...@runorg.com> wrote:
>
>> Hello,
>>
>> The @CouchDB twitter account thought you might find this information
>> helpful.
>>
>> My SaaS start-up uses CouchDB as its primary database. Lately, I have been
>> having database corruption issues with version 1.2.0 : every few weeks, one
>> of our databases becomes corrupted, which has several negative consequences
>> (among others) :
>>
>>    - Replication of that database fails (it does not even start).
>>    - Compaction of that database fails and *freezes* the server.
>>    - Several documents in the database become inaccessible through either
>>    direct access or through _all_docs.
>>
>>  The latest affected database does not contain any information about our
>> customers, so I am allowed to release it publicly :
>>
>> http://nicollet.net/public/2013-04-18.couchdb/prod-folder.couch
>>
>> This database contains 325 irretrievable documents between identifiers
>> 2xFEY0pU2Eb and 3Fn6l04G6Oa.
>> I hope this helps,
>>
>> --
>> Victor Nicollet, CTO, www.runorg.com
>>
>
>
>
> --
> Victor Nicollet, Directeur Technique, www.runorg.com

Re: Corrupted database example file

Posted by Victor Nicollet <vn...@runorg.com>.

Replying to my own mail, hoping it will end up in the same thread (I was
not fully subscribed when I posted this, but I still read the archives).

Answers to the questions you asked :

 - I have no idea when the issue happened. I will try to track it down in
the logs. I'm afraid I don't have time to filter out all customer
information from the logs and provide them to you, though I can certainly
grep for error dumps if you want me to. I have never seen disk-related
errors in the log.
 - I am running Debian x86_64 GNU/Linux, with erlang 1:15.b.1-d
 - There are no unusual CouchDB configuration options ; the only change I
performed was to disable reduce_limit. A perhaps notable usage aspect : all
the databases are compacted hourly.
 - It's not NFS. From /etc/fstab :

/dev/sda1       /       ext4    errors=remount-ro       0       1
/dev/sda2       /home   ext4    defaults                0       2

The dual-partition setup is a silly default from OVH (my dedicated server
host), so I have /var/lib/couchdb as a symlink to /home/couchdb/lib, from
sda1 to sda2.

- I can't rule out a disk issue, because I don't have a lot of experience
with those... any obvious diagnosis command you would like me to run ? I am
certain that I have not run out of disk space, though (still around 1TB
free on that drive).

Thank you for your patience.

On 18 April 2013 14:17, Victor Nicollet <vn...@runorg.com> wrote:

> Hello,
>
> The @CouchDB twitter account thought you might find this information
> helpful.
>
> My SaaS start-up uses CouchDB as its primary database. Lately, I have been
> having database corruption issues with version 1.2.0 : every few weeks, one
> of our databases becomes corrupted, which has several negative consequences
> (among others) :
>
>    - Replication of that database fails (it does not even start).
>    - Compaction of that database fails and *freezes* the server.
>    - Several documents in the database become inaccessible through either
>    direct access or through _all_docs.
>
>  The latest affected database does not contain any information about our
> customers, so I am allowed to release it publicly :
>
> http://nicollet.net/public/2013-04-18.couchdb/prod-folder.couch
>
> This database contains 325 irretrievable documents between identifiers
> 2xFEY0pU2Eb and 3Fn6l04G6Oa.
> I hope this helps,
>
> --
> Victor Nicollet, CTO, www.runorg.com
>



-- 
Victor Nicollet, Directeur Technique, www.runorg.com

Re: Corrupted database example file

Posted by Dave Cottlehuber <dc...@jsonified.com>.

On 18 April 2013 14:17, Victor Nicollet <vn...@runorg.com> wrote:
> Hello,
>
> The @CouchDB twitter account thought you might find this information
> helpful.
>
> My SaaS start-up uses CouchDB as its primary database. Lately, I have been
> having database corruption issues with version 1.2.0 : every few weeks, one
> of our databases becomes corrupted, which has several negative consequences
> (among others) :
>
>    - Replication of that database fails (it does not even start).
>    - Compaction of that database fails and *freezes* the server.
>    - Several documents in the database become inaccessible through either
>    direct access or through _all_docs.
>
>  The latest affected database does not contain any information about our
> customers, so I am allowed to release it publicly :
>
> http://nicollet.net/public/2013-04-18.couchdb/prod-folder.couch
>
> This database contains 325 irretrievable documents between identifiers
> 2xFEY0pU2Eb and 3Fn6l04G6Oa.
> I hope this helps,
>
> --
> Victor Nicollet, CTO, www.runorg.com


Salut Victor,

Thanks for reporting this.

- what erlang release are you running, and on what OS?
- are there any disk-related messages in the logfiles?
- can you make more of the couchdb.log available to us, even privately?
- any additional build information?
- any special configuration?

A+
Dave

Re: Corrupted database example file

Posted by Ashraf <as...@yahoo.com>.

Thanks,
But what must i do first?
I want to build data base via objective-c.i dont know how i can used http post method for send data to couchDB.

Skickat från min iPhone

18 apr 2013 kl. 19:49 skrev Robert Newson <rn...@apache.org>:

> Hi Ashraf,
> 
> You've replied to an existing thread about a different topic, we'll
> respond to your query when you start a new thread.
> 
> B.
> 
> On 18 April 2013 18:47, Ashraf <as...@yahoo.com> wrote:
>> Hello all!
>> Pls. How can i send data via post method with nsurlconnection in objective-c ?
>> 
>> 
>> Skickat från min iPhone
>> 
>> 18 apr 2013 kl. 18:47 skrev Paul Davis <pa...@gmail.com>:
>> 
>>> Investigating this locally. Reproduced in that it fails locally. I've
>>> narrowed down the exact binary that's failing snappy decompression
>>> fairly easily. So far my debugging is showing that snappy is trying to
>>> decode this binary into a format that ends up exceeding the
>>> decompressed size that it thought it was going to need which suggests
>>> some sort of corruption within this binary. Still trying to figure out
>>> where the exact bit of corruption is.
>>> 
>>> In case anyone is interested, here's the binary:
>>> 
>>> <<148,8,232,131,104,2,100,0,7,107,112,95,110,111,100,101,108,0,0,0,
>>> 25,104,2,109,0,0,0,11,50,122,119,82,121,49,90,66,49,118,107,104,3,
>>> 98,0,8,232,117,104,3,97,13,97,0,98,0,0,8,98,98,0,0,3,36,13,41,44,
>>> 51,49,50,81,113,48,114,81,56,84,78,104,1,41,20,235,153,104,3,97,
>>> 12,5,41,24,7,54,98,0,0,3,28,17,41,36,51,50,86,110,48,77,121,55,75,
>>> 82,5,41,4,238,181,21,41,4,6,158,1,41,0,1,29,41,32,56,99,70,48,98,
>>> 111,55,75,81,5,41,4,241,182,25,41,20,137,98,0,0,2,221,21,41,32,
>>> 114,71,108,48,53,111,55,109,87,5,41,4,244,147,21,164,4,7,9,1,41,0,
>>> 211,17,41,36,52,114,69,85,48,50,49,55,67,109,5,41,4,247,102,25,41,
>>> 0,25,1,123,0,5,17,41,36,53,68,108,48,48,115,120,55,67,100,5,41,4,
>>> 250,107,25,41,0,18,1,82,0,240,21,41,20,101,89,79,49,78,113,17,82,
>>> 20,253,91,104,3,97,14,9,246,5,164,0,244,21,41,20,104,116,65,49,98,
>>> 90,13,82,8,9,0,79,21,82,4,6,254,1,82,0,237,17,41,24,54,51,110,104,
>>> 50,80,100,13,82,8,9,3,60,21,41,4,7,30,1,164,0,12,21,41,20,78,120,
>>> 81,50,109,52,17,82,4,6,72,21,41,9,82,0,222,17,41,36,55,56,53,110,
>>> 48,68,107,53,102,76,1,205,8,9,9,38,57,72,0,143,1,123,0,241,17,41,
>>> 36,56,69,53,52,48,107,55,54,70,52,5,41,4,12,23,21,82,4,7,27,1,41,
>>> 0,224,25,41,28,115,117,48,51,79,55,76,102,5,41,20,14,247,104,3,97,
>>> 15,9,246,0,235,1,41,0,185,21,41,32,89,67,57,48,55,119,48,103,68,5,
>>> 41,4,17,176,25,82,0,15,1,41,0,208,21,41,32,114,106,117,48,90,97,
>>> 48,103,66,5,41,4,20,128,25,164,0,139,1,41,0,218,17,41,32,57,97,73,
>>> 90,48,52,52,54,97,69,62,8,9,23,90,25,82,0,12,1,41,21,82,32,65,104,
>>> 121,78,49,98,70,54,98,37,236,8,9,26,42,25,41,0,21,1,41,53,154,64,
>>> 66,48,117,98,48,56,70,55,50,73,104,3,98,0,9,29,30,25,123,0,153,1,
>>> 41,0,240,17,123,32,66,51,76,51,48,76,90,55,51,9,82,4,32,14,25,82,
>>> 0,5,1,41,0,210,21,41,20,76,101,121,48,115,50,17,82,4,34,224,25,82,
>>> 37,113,0,215,17,41,36,67,122,87,103,48,84,117,52,107,119,5,123,8,
>>> 37,183,104,85,62,0,115,50,82,0,36,69,77,55,104,48,76,108,52,67,
>>> 105,5,41,4,40,137,25,82,5,164,0,245,17,82,32,69,78,68,98,48,51,81,
>>> 48,121,41,72,4,43,126,57,154,0,233,1,82,53,154,36,70,110,54,108,
>>> 48,52,71,54,79,90,5,82,4,46,55,25,123,24,145,98,0,0,2,245,106>>
>>> 
>>> On Thu, Apr 18, 2013 at 9:03 AM, Benoit Chesneau <bc...@gmail.com> wrote:
>>>> On Thu, Apr 18, 2013 at 3:05 PM, Robert Newson <rn...@apache.org> wrote:
>>>>> Hi Victor,
>>>>> 
>>>>> Thanks for the report and for capturing a .couch file in this state. I
>>>>> have reproduced the error locally with 1.2.0.
>>>>> 
>>>>> Can you tell me anything about what was happening before this
>>>>> happened? Did you ever run out of disk space or have other disk
>>>>> issues? What operating system? what version of erlang? which
>>>>> filesystem? which mount options?
>>>> also what was the load at that point (CPU and such). Do you have any
>>>> other error in logs before that?
>>>> 
>>>> - benoit

Re: Corrupted database example file

Posted by Robert Newson <rn...@apache.org>.

Ashraf,

Please stop hijacking threads and start a new one. We will not respond
with assistance until you do so.

B.

On 18 April 2013 19:09, Ashraf <as...@yahoo.com> wrote:
> Thanks,
> But what must i do first?
> I want to build data base via objective-c.i dont know how i can used http post method for send data to couchDB
>
> Skickat från min iPhone
>
> 18 apr 2013 kl. 19:49 skrev Robert Newson <rn...@apache.org>:
>
>> Hi Ashraf,
>>
>> You've replied to an existing thread about a different topic, we'll
>> respond to your query when you start a new thread.
>>
>> B.
>>
>> On 18 April 2013 18:47, Ashraf <as...@yahoo.com> wrote:
>>> Hello all!
>>> Pls. How can i send data via post method with nsurlconnection in objective-c ?
>>>
>>>
>>> Skickat från min iPhone
>>>
>>> 18 apr 2013 kl. 18:47 skrev Paul Davis <pa...@gmail.com>:
>>>
>>>> Investigating this locally. Reproduced in that it fails locally. I've
>>>> narrowed down the exact binary that's failing snappy decompression
>>>> fairly easily. So far my debugging is showing that snappy is trying to
>>>> decode this binary into a format that ends up exceeding the
>>>> decompressed size that it thought it was going to need which suggests
>>>> some sort of corruption within this binary. Still trying to figure out
>>>> where the exact bit of corruption is.
>>>>
>>>> In case anyone is interested, here's the binary:
>>>>
>>>> <<148,8,232,131,104,2,100,0,7,107,112,95,110,111,100,101,108,0,0,0,
>>>> 25,104,2,109,0,0,0,11,50,122,119,82,121,49,90,66,49,118,107,104,3,
>>>> 98,0,8,232,117,104,3,97,13,97,0,98,0,0,8,98,98,0,0,3,36,13,41,44,
>>>> 51,49,50,81,113,48,114,81,56,84,78,104,1,41,20,235,153,104,3,97,
>>>> 12,5,41,24,7,54,98,0,0,3,28,17,41,36,51,50,86,110,48,77,121,55,75,
>>>> 82,5,41,4,238,181,21,41,4,6,158,1,41,0,1,29,41,32,56,99,70,48,98,
>>>> 111,55,75,81,5,41,4,241,182,25,41,20,137,98,0,0,2,221,21,41,32,
>>>> 114,71,108,48,53,111,55,109,87,5,41,4,244,147,21,164,4,7,9,1,41,0,
>>>> 211,17,41,36,52,114,69,85,48,50,49,55,67,109,5,41,4,247,102,25,41,
>>>> 0,25,1,123,0,5,17,41,36,53,68,108,48,48,115,120,55,67,100,5,41,4,
>>>> 250,107,25,41,0,18,1,82,0,240,21,41,20,101,89,79,49,78,113,17,82,
>>>> 20,253,91,104,3,97,14,9,246,5,164,0,244,21,41,20,104,116,65,49,98,
>>>> 90,13,82,8,9,0,79,21,82,4,6,254,1,82,0,237,17,41,24,54,51,110,104,
>>>> 50,80,100,13,82,8,9,3,60,21,41,4,7,30,1,164,0,12,21,41,20,78,120,
>>>> 81,50,109,52,17,82,4,6,72,21,41,9,82,0,222,17,41,36,55,56,53,110,
>>>> 48,68,107,53,102,76,1,205,8,9,9,38,57,72,0,143,1,123,0,241,17,41,
>>>> 36,56,69,53,52,48,107,55,54,70,52,5,41,4,12,23,21,82,4,7,27,1,41,
>>>> 0,224,25,41,28,115,117,48,51,79,55,76,102,5,41,20,14,247,104,3,97,
>>>> 15,9,246,0,235,1,41,0,185,21,41,32,89,67,57,48,55,119,48,103,68,5,
>>>> 41,4,17,176,25,82,0,15,1,41,0,208,21,41,32,114,106,117,48,90,97,
>>>> 48,103,66,5,41,4,20,128,25,164,0,139,1,41,0,218,17,41,32,57,97,73,
>>>> 90,48,52,52,54,97,69,62,8,9,23,90,25,82,0,12,1,41,21,82,32,65,104,
>>>> 121,78,49,98,70,54,98,37,236,8,9,26,42,25,41,0,21,1,41,53,154,64,
>>>> 66,48,117,98,48,56,70,55,50,73,104,3,98,0,9,29,30,25,123,0,153,1,
>>>> 41,0,240,17,123,32,66,51,76,51,48,76,90,55,51,9,82,4,32,14,25,82,
>>>> 0,5,1,41,0,210,21,41,20,76,101,121,48,115,50,17,82,4,34,224,25,82,
>>>> 37,113,0,215,17,41,36,67,122,87,103,48,84,117,52,107,119,5,123,8,
>>>> 37,183,104,85,62,0,115,50,82,0,36,69,77,55,104,48,76,108,52,67,
>>>> 105,5,41,4,40,137,25,82,5,164,0,245,17,82,32,69,78,68,98,48,51,81,
>>>> 48,121,41,72,4,43,126,57,154,0,233,1,82,53,154,36,70,110,54,108,
>>>> 48,52,71,54,79,90,5,82,4,46,55,25,123,24,145,98,0,0,2,245,106>>
>>>>
>>>> On Thu, Apr 18, 2013 at 9:03 AM, Benoit Chesneau <bc...@gmail.com> wrote:
>>>>> On Thu, Apr 18, 2013 at 3:05 PM, Robert Newson <rn...@apache.org> wrote:
>>>>>> Hi Victor,
>>>>>>
>>>>>> Thanks for the report and for capturing a .couch file in this state. I
>>>>>> have reproduced the error locally with 1.2.0.
>>>>>>
>>>>>> Can you tell me anything about what was happening before this
>>>>>> happened? Did you ever run out of disk space or have other disk
>>>>>> issues? What operating system? what version of erlang? which
>>>>>> filesystem? which mount options?
>>>>> also what was the load at that point (CPU and such). Do you have any
>>>>> other error in logs before that?
>>>>>
>>>>> - benoit

Re: Corrupted database example file

Posted by Ashraf <as...@yahoo.com>.

Thanks,
But what must i do first?
I want to build data base via objective-c.i dont know how i can used http post method for send data to couchDB

Skickat från min iPhone

18 apr 2013 kl. 19:49 skrev Robert Newson <rn...@apache.org>:

> Hi Ashraf,
> 
> You've replied to an existing thread about a different topic, we'll
> respond to your query when you start a new thread.
> 
> B.
> 
> On 18 April 2013 18:47, Ashraf <as...@yahoo.com> wrote:
>> Hello all!
>> Pls. How can i send data via post method with nsurlconnection in objective-c ?
>> 
>> 
>> Skickat från min iPhone
>> 
>> 18 apr 2013 kl. 18:47 skrev Paul Davis <pa...@gmail.com>:
>> 
>>> Investigating this locally. Reproduced in that it fails locally. I've
>>> narrowed down the exact binary that's failing snappy decompression
>>> fairly easily. So far my debugging is showing that snappy is trying to
>>> decode this binary into a format that ends up exceeding the
>>> decompressed size that it thought it was going to need which suggests
>>> some sort of corruption within this binary. Still trying to figure out
>>> where the exact bit of corruption is.
>>> 
>>> In case anyone is interested, here's the binary:
>>> 
>>> <<148,8,232,131,104,2,100,0,7,107,112,95,110,111,100,101,108,0,0,0,
>>> 25,104,2,109,0,0,0,11,50,122,119,82,121,49,90,66,49,118,107,104,3,
>>> 98,0,8,232,117,104,3,97,13,97,0,98,0,0,8,98,98,0,0,3,36,13,41,44,
>>> 51,49,50,81,113,48,114,81,56,84,78,104,1,41,20,235,153,104,3,97,
>>> 12,5,41,24,7,54,98,0,0,3,28,17,41,36,51,50,86,110,48,77,121,55,75,
>>> 82,5,41,4,238,181,21,41,4,6,158,1,41,0,1,29,41,32,56,99,70,48,98,
>>> 111,55,75,81,5,41,4,241,182,25,41,20,137,98,0,0,2,221,21,41,32,
>>> 114,71,108,48,53,111,55,109,87,5,41,4,244,147,21,164,4,7,9,1,41,0,
>>> 211,17,41,36,52,114,69,85,48,50,49,55,67,109,5,41,4,247,102,25,41,
>>> 0,25,1,123,0,5,17,41,36,53,68,108,48,48,115,120,55,67,100,5,41,4,
>>> 250,107,25,41,0,18,1,82,0,240,21,41,20,101,89,79,49,78,113,17,82,
>>> 20,253,91,104,3,97,14,9,246,5,164,0,244,21,41,20,104,116,65,49,98,
>>> 90,13,82,8,9,0,79,21,82,4,6,254,1,82,0,237,17,41,24,54,51,110,104,
>>> 50,80,100,13,82,8,9,3,60,21,41,4,7,30,1,164,0,12,21,41,20,78,120,
>>> 81,50,109,52,17,82,4,6,72,21,41,9,82,0,222,17,41,36,55,56,53,110,
>>> 48,68,107,53,102,76,1,205,8,9,9,38,57,72,0,143,1,123,0,241,17,41,
>>> 36,56,69,53,52,48,107,55,54,70,52,5,41,4,12,23,21,82,4,7,27,1,41,
>>> 0,224,25,41,28,115,117,48,51,79,55,76,102,5,41,20,14,247,104,3,97,
>>> 15,9,246,0,235,1,41,0,185,21,41,32,89,67,57,48,55,119,48,103,68,5,
>>> 41,4,17,176,25,82,0,15,1,41,0,208,21,41,32,114,106,117,48,90,97,
>>> 48,103,66,5,41,4,20,128,25,164,0,139,1,41,0,218,17,41,32,57,97,73,
>>> 90,48,52,52,54,97,69,62,8,9,23,90,25,82,0,12,1,41,21,82,32,65,104,
>>> 121,78,49,98,70,54,98,37,236,8,9,26,42,25,41,0,21,1,41,53,154,64,
>>> 66,48,117,98,48,56,70,55,50,73,104,3,98,0,9,29,30,25,123,0,153,1,
>>> 41,0,240,17,123,32,66,51,76,51,48,76,90,55,51,9,82,4,32,14,25,82,
>>> 0,5,1,41,0,210,21,41,20,76,101,121,48,115,50,17,82,4,34,224,25,82,
>>> 37,113,0,215,17,41,36,67,122,87,103,48,84,117,52,107,119,5,123,8,
>>> 37,183,104,85,62,0,115,50,82,0,36,69,77,55,104,48,76,108,52,67,
>>> 105,5,41,4,40,137,25,82,5,164,0,245,17,82,32,69,78,68,98,48,51,81,
>>> 48,121,41,72,4,43,126,57,154,0,233,1,82,53,154,36,70,110,54,108,
>>> 48,52,71,54,79,90,5,82,4,46,55,25,123,24,145,98,0,0,2,245,106>>
>>> 
>>> On Thu, Apr 18, 2013 at 9:03 AM, Benoit Chesneau <bc...@gmail.com> wrote:
>>>> On Thu, Apr 18, 2013 at 3:05 PM, Robert Newson <rn...@apache.org> wrote:
>>>>> Hi Victor,
>>>>> 
>>>>> Thanks for the report and for capturing a .couch file in this state. I
>>>>> have reproduced the error locally with 1.2.0.
>>>>> 
>>>>> Can you tell me anything about what was happening before this
>>>>> happened? Did you ever run out of disk space or have other disk
>>>>> issues? What operating system? what version of erlang? which
>>>>> filesystem? which mount options?
>>>> also what was the load at that point (CPU and such). Do you have any
>>>> other error in logs before that?
>>>> 
>>>> - benoit

Re: Corrupted database example file

Posted by Robert Newson <rn...@apache.org>.

Hi Ashraf,

You've replied to an existing thread about a different topic, we'll
respond to your query when you start a new thread.

B.

On 18 April 2013 18:47, Ashraf <as...@yahoo.com> wrote:
> Hello all!
> Pls. How can i send data via post method with nsurlconnection in objective-c ?
>
>
> Skickat från min iPhone
>
> 18 apr 2013 kl. 18:47 skrev Paul Davis <pa...@gmail.com>:
>
>> Investigating this locally. Reproduced in that it fails locally. I've
>> narrowed down the exact binary that's failing snappy decompression
>> fairly easily. So far my debugging is showing that snappy is trying to
>> decode this binary into a format that ends up exceeding the
>> decompressed size that it thought it was going to need which suggests
>> some sort of corruption within this binary. Still trying to figure out
>> where the exact bit of corruption is.
>>
>> In case anyone is interested, here's the binary:
>>
>> <<148,8,232,131,104,2,100,0,7,107,112,95,110,111,100,101,108,0,0,0,
>> 25,104,2,109,0,0,0,11,50,122,119,82,121,49,90,66,49,118,107,104,3,
>> 98,0,8,232,117,104,3,97,13,97,0,98,0,0,8,98,98,0,0,3,36,13,41,44,
>> 51,49,50,81,113,48,114,81,56,84,78,104,1,41,20,235,153,104,3,97,
>> 12,5,41,24,7,54,98,0,0,3,28,17,41,36,51,50,86,110,48,77,121,55,75,
>> 82,5,41,4,238,181,21,41,4,6,158,1,41,0,1,29,41,32,56,99,70,48,98,
>> 111,55,75,81,5,41,4,241,182,25,41,20,137,98,0,0,2,221,21,41,32,
>> 114,71,108,48,53,111,55,109,87,5,41,4,244,147,21,164,4,7,9,1,41,0,
>> 211,17,41,36,52,114,69,85,48,50,49,55,67,109,5,41,4,247,102,25,41,
>> 0,25,1,123,0,5,17,41,36,53,68,108,48,48,115,120,55,67,100,5,41,4,
>> 250,107,25,41,0,18,1,82,0,240,21,41,20,101,89,79,49,78,113,17,82,
>> 20,253,91,104,3,97,14,9,246,5,164,0,244,21,41,20,104,116,65,49,98,
>> 90,13,82,8,9,0,79,21,82,4,6,254,1,82,0,237,17,41,24,54,51,110,104,
>> 50,80,100,13,82,8,9,3,60,21,41,4,7,30,1,164,0,12,21,41,20,78,120,
>> 81,50,109,52,17,82,4,6,72,21,41,9,82,0,222,17,41,36,55,56,53,110,
>> 48,68,107,53,102,76,1,205,8,9,9,38,57,72,0,143,1,123,0,241,17,41,
>> 36,56,69,53,52,48,107,55,54,70,52,5,41,4,12,23,21,82,4,7,27,1,41,
>> 0,224,25,41,28,115,117,48,51,79,55,76,102,5,41,20,14,247,104,3,97,
>> 15,9,246,0,235,1,41,0,185,21,41,32,89,67,57,48,55,119,48,103,68,5,
>> 41,4,17,176,25,82,0,15,1,41,0,208,21,41,32,114,106,117,48,90,97,
>> 48,103,66,5,41,4,20,128,25,164,0,139,1,41,0,218,17,41,32,57,97,73,
>> 90,48,52,52,54,97,69,62,8,9,23,90,25,82,0,12,1,41,21,82,32,65,104,
>> 121,78,49,98,70,54,98,37,236,8,9,26,42,25,41,0,21,1,41,53,154,64,
>> 66,48,117,98,48,56,70,55,50,73,104,3,98,0,9,29,30,25,123,0,153,1,
>> 41,0,240,17,123,32,66,51,76,51,48,76,90,55,51,9,82,4,32,14,25,82,
>> 0,5,1,41,0,210,21,41,20,76,101,121,48,115,50,17,82,4,34,224,25,82,
>> 37,113,0,215,17,41,36,67,122,87,103,48,84,117,52,107,119,5,123,8,
>> 37,183,104,85,62,0,115,50,82,0,36,69,77,55,104,48,76,108,52,67,
>> 105,5,41,4,40,137,25,82,5,164,0,245,17,82,32,69,78,68,98,48,51,81,
>> 48,121,41,72,4,43,126,57,154,0,233,1,82,53,154,36,70,110,54,108,
>> 48,52,71,54,79,90,5,82,4,46,55,25,123,24,145,98,0,0,2,245,106>>
>>
>> On Thu, Apr 18, 2013 at 9:03 AM, Benoit Chesneau <bc...@gmail.com> wrote:
>>> On Thu, Apr 18, 2013 at 3:05 PM, Robert Newson <rn...@apache.org> wrote:
>>>> Hi Victor,
>>>>
>>>> Thanks for the report and for capturing a .couch file in this state. I
>>>> have reproduced the error locally with 1.2.0.
>>>>
>>>> Can you tell me anything about what was happening before this
>>>> happened? Did you ever run out of disk space or have other disk
>>>> issues? What operating system? what version of erlang? which
>>>> filesystem? which mount options?
>>> also what was the load at that point (CPU and such). Do you have any
>>> other error in logs before that?
>>>
>>> - benoit

Re: Corrupted database example file

Posted by Ashraf <as...@yahoo.com>.

Hello all!
Pls. How can i send data via post method with nsurlconnection in objective-c ?


Skickat från min iPhone

18 apr 2013 kl. 18:47 skrev Paul Davis <pa...@gmail.com>:

> Investigating this locally. Reproduced in that it fails locally. I've
> narrowed down the exact binary that's failing snappy decompression
> fairly easily. So far my debugging is showing that snappy is trying to
> decode this binary into a format that ends up exceeding the
> decompressed size that it thought it was going to need which suggests
> some sort of corruption within this binary. Still trying to figure out
> where the exact bit of corruption is.
> 
> In case anyone is interested, here's the binary:
> 
> <<148,8,232,131,104,2,100,0,7,107,112,95,110,111,100,101,108,0,0,0,
> 25,104,2,109,0,0,0,11,50,122,119,82,121,49,90,66,49,118,107,104,3,
> 98,0,8,232,117,104,3,97,13,97,0,98,0,0,8,98,98,0,0,3,36,13,41,44,
> 51,49,50,81,113,48,114,81,56,84,78,104,1,41,20,235,153,104,3,97,
> 12,5,41,24,7,54,98,0,0,3,28,17,41,36,51,50,86,110,48,77,121,55,75,
> 82,5,41,4,238,181,21,41,4,6,158,1,41,0,1,29,41,32,56,99,70,48,98,
> 111,55,75,81,5,41,4,241,182,25,41,20,137,98,0,0,2,221,21,41,32,
> 114,71,108,48,53,111,55,109,87,5,41,4,244,147,21,164,4,7,9,1,41,0,
> 211,17,41,36,52,114,69,85,48,50,49,55,67,109,5,41,4,247,102,25,41,
> 0,25,1,123,0,5,17,41,36,53,68,108,48,48,115,120,55,67,100,5,41,4,
> 250,107,25,41,0,18,1,82,0,240,21,41,20,101,89,79,49,78,113,17,82,
> 20,253,91,104,3,97,14,9,246,5,164,0,244,21,41,20,104,116,65,49,98,
> 90,13,82,8,9,0,79,21,82,4,6,254,1,82,0,237,17,41,24,54,51,110,104,
> 50,80,100,13,82,8,9,3,60,21,41,4,7,30,1,164,0,12,21,41,20,78,120,
> 81,50,109,52,17,82,4,6,72,21,41,9,82,0,222,17,41,36,55,56,53,110,
> 48,68,107,53,102,76,1,205,8,9,9,38,57,72,0,143,1,123,0,241,17,41,
> 36,56,69,53,52,48,107,55,54,70,52,5,41,4,12,23,21,82,4,7,27,1,41,
> 0,224,25,41,28,115,117,48,51,79,55,76,102,5,41,20,14,247,104,3,97,
> 15,9,246,0,235,1,41,0,185,21,41,32,89,67,57,48,55,119,48,103,68,5,
> 41,4,17,176,25,82,0,15,1,41,0,208,21,41,32,114,106,117,48,90,97,
> 48,103,66,5,41,4,20,128,25,164,0,139,1,41,0,218,17,41,32,57,97,73,
> 90,48,52,52,54,97,69,62,8,9,23,90,25,82,0,12,1,41,21,82,32,65,104,
> 121,78,49,98,70,54,98,37,236,8,9,26,42,25,41,0,21,1,41,53,154,64,
> 66,48,117,98,48,56,70,55,50,73,104,3,98,0,9,29,30,25,123,0,153,1,
> 41,0,240,17,123,32,66,51,76,51,48,76,90,55,51,9,82,4,32,14,25,82,
> 0,5,1,41,0,210,21,41,20,76,101,121,48,115,50,17,82,4,34,224,25,82,
> 37,113,0,215,17,41,36,67,122,87,103,48,84,117,52,107,119,5,123,8,
> 37,183,104,85,62,0,115,50,82,0,36,69,77,55,104,48,76,108,52,67,
> 105,5,41,4,40,137,25,82,5,164,0,245,17,82,32,69,78,68,98,48,51,81,
> 48,121,41,72,4,43,126,57,154,0,233,1,82,53,154,36,70,110,54,108,
> 48,52,71,54,79,90,5,82,4,46,55,25,123,24,145,98,0,0,2,245,106>>
> 
> On Thu, Apr 18, 2013 at 9:03 AM, Benoit Chesneau <bc...@gmail.com> wrote:
>> On Thu, Apr 18, 2013 at 3:05 PM, Robert Newson <rn...@apache.org> wrote:
>>> Hi Victor,
>>> 
>>> Thanks for the report and for capturing a .couch file in this state. I
>>> have reproduced the error locally with 1.2.0.
>>> 
>>> Can you tell me anything about what was happening before this
>>> happened? Did you ever run out of disk space or have other disk
>>> issues? What operating system? what version of erlang? which
>>> filesystem? which mount options?
>> also what was the load at that point (CPU and such). Do you have any
>> other error in logs before that?
>> 
>> - benoit

Re: Corrupted database example file

Posted by Paul Davis <pa...@gmail.com>.

Investigating this locally. Reproduced in that it fails locally. I've
narrowed down the exact binary that's failing snappy decompression
fairly easily. So far my debugging is showing that snappy is trying to
decode this binary into a format that ends up exceeding the
decompressed size that it thought it was going to need which suggests
some sort of corruption within this binary. Still trying to figure out
where the exact bit of corruption is.

In case anyone is interested, here's the binary:

<<148,8,232,131,104,2,100,0,7,107,112,95,110,111,100,101,108,0,0,0,
 25,104,2,109,0,0,0,11,50,122,119,82,121,49,90,66,49,118,107,104,3,
 98,0,8,232,117,104,3,97,13,97,0,98,0,0,8,98,98,0,0,3,36,13,41,44,
 51,49,50,81,113,48,114,81,56,84,78,104,1,41,20,235,153,104,3,97,
 12,5,41,24,7,54,98,0,0,3,28,17,41,36,51,50,86,110,48,77,121,55,75,
 82,5,41,4,238,181,21,41,4,6,158,1,41,0,1,29,41,32,56,99,70,48,98,
 111,55,75,81,5,41,4,241,182,25,41,20,137,98,0,0,2,221,21,41,32,
 114,71,108,48,53,111,55,109,87,5,41,4,244,147,21,164,4,7,9,1,41,0,
 211,17,41,36,52,114,69,85,48,50,49,55,67,109,5,41,4,247,102,25,41,
 0,25,1,123,0,5,17,41,36,53,68,108,48,48,115,120,55,67,100,5,41,4,
 250,107,25,41,0,18,1,82,0,240,21,41,20,101,89,79,49,78,113,17,82,
 20,253,91,104,3,97,14,9,246,5,164,0,244,21,41,20,104,116,65,49,98,
 90,13,82,8,9,0,79,21,82,4,6,254,1,82,0,237,17,41,24,54,51,110,104,
 50,80,100,13,82,8,9,3,60,21,41,4,7,30,1,164,0,12,21,41,20,78,120,
 81,50,109,52,17,82,4,6,72,21,41,9,82,0,222,17,41,36,55,56,53,110,
 48,68,107,53,102,76,1,205,8,9,9,38,57,72,0,143,1,123,0,241,17,41,
 36,56,69,53,52,48,107,55,54,70,52,5,41,4,12,23,21,82,4,7,27,1,41,
 0,224,25,41,28,115,117,48,51,79,55,76,102,5,41,20,14,247,104,3,97,
 15,9,246,0,235,1,41,0,185,21,41,32,89,67,57,48,55,119,48,103,68,5,
 41,4,17,176,25,82,0,15,1,41,0,208,21,41,32,114,106,117,48,90,97,
 48,103,66,5,41,4,20,128,25,164,0,139,1,41,0,218,17,41,32,57,97,73,
 90,48,52,52,54,97,69,62,8,9,23,90,25,82,0,12,1,41,21,82,32,65,104,
 121,78,49,98,70,54,98,37,236,8,9,26,42,25,41,0,21,1,41,53,154,64,
 66,48,117,98,48,56,70,55,50,73,104,3,98,0,9,29,30,25,123,0,153,1,
 41,0,240,17,123,32,66,51,76,51,48,76,90,55,51,9,82,4,32,14,25,82,
 0,5,1,41,0,210,21,41,20,76,101,121,48,115,50,17,82,4,34,224,25,82,
 37,113,0,215,17,41,36,67,122,87,103,48,84,117,52,107,119,5,123,8,
 37,183,104,85,62,0,115,50,82,0,36,69,77,55,104,48,76,108,52,67,
 105,5,41,4,40,137,25,82,5,164,0,245,17,82,32,69,78,68,98,48,51,81,
 48,121,41,72,4,43,126,57,154,0,233,1,82,53,154,36,70,110,54,108,
 48,52,71,54,79,90,5,82,4,46,55,25,123,24,145,98,0,0,2,245,106>>

On Thu, Apr 18, 2013 at 9:03 AM, Benoit Chesneau <bc...@gmail.com> wrote:
> On Thu, Apr 18, 2013 at 3:05 PM, Robert Newson <rn...@apache.org> wrote:
>> Hi Victor,
>>
>> Thanks for the report and for capturing a .couch file in this state. I
>> have reproduced the error locally with 1.2.0.
>>
>> Can you tell me anything about what was happening before this
>> happened? Did you ever run out of disk space or have other disk
>> issues? What operating system? what version of erlang? which
>> filesystem? which mount options?
>>
> also what was the load at that point (CPU and such). Do you have any
> other error in logs before that?
>
> - benoit

Re: Corrupted database example file

Posted by Benoit Chesneau <bc...@gmail.com>.

On Thu, Apr 18, 2013 at 3:05 PM, Robert Newson <rn...@apache.org> wrote:
> Hi Victor,
>
> Thanks for the report and for capturing a .couch file in this state. I
> have reproduced the error locally with 1.2.0.
>
> Can you tell me anything about what was happening before this
> happened? Did you ever run out of disk space or have other disk
> issues? What operating system? what version of erlang? which
> filesystem? which mount options?
>
also what was the load at that point (CPU and such). Do you have any
other error in logs before that?

- benoit

Re: Corrupted database example file

Posted by Robert Newson <rn...@apache.org>.

Hi Victor,

Thanks for the report and for capturing a .couch file in this state. I
have reproduced the error locally with 1.2.0.

Can you tell me anything about what was happening before this
happened? Did you ever run out of disk space or have other disk
issues? What operating system? what version of erlang? which
filesystem? which mount options?

B.

On 18 April 2013 13:17, Victor Nicollet <vn...@runorg.com> wrote:
> Hello,
>
> The @CouchDB twitter account thought you might find this information
> helpful.
>
> My SaaS start-up uses CouchDB as its primary database. Lately, I have been
> having database corruption issues with version 1.2.0 : every few weeks, one
> of our databases becomes corrupted, which has several negative consequences
> (among others) :
>
>    - Replication of that database fails (it does not even start).
>    - Compaction of that database fails and *freezes* the server.
>    - Several documents in the database become inaccessible through either
>    direct access or through _all_docs.
>
>  The latest affected database does not contain any information about our
> customers, so I am allowed to release it publicly :
>
> http://nicollet.net/public/2013-04-18.couchdb/prod-folder.couch
>
> This database contains 325 irretrievable documents between identifiers
> 2xFEY0pU2Eb and 3Fn6l04G6Oa.
> I hope this helps,
>
> --
> Victor Nicollet, CTO, www.runorg.com

Re: Corrupted database example file

Posted by Paul Davis <pa...@gmail.com>.

Thanks again for the database to debug. After poking at this for
awhile I can't really come up with solid lead on what might have
happened. I can force decompression to succeed by altering the binary
but the decompressed binary that comes out is quite obviously
corrupted in a number of spots which makes me think that the snappy
compressed binary is actually quite messed up. Rather than spend more
time trying to reverse engineer snappy's compression algorithm I'm
gonna wait to see what sort of file system settings you're running
with.

Specifically of interest is if you're running CouchDB over NFS or if
you have disabled fsync in the CouchDB configuration. Either of those
could lead to behavior like this. Obviously given the wide scale usage
of CouchDB and the scarcity of corruption reports the assumption is
that something in your configuration is breaking the contract for
POSIX apis getting data onto disk safely.

And on the off chance, have you got any idea if this disk is going
bad? Or are you getting corruption on multiple machines?

Let us know what you can. Obviously corruption like this is serious
business for us.

On Thu, Apr 18, 2013 at 7:17 AM, Victor Nicollet <vn...@runorg.com> wrote:
> Hello,
>
> The @CouchDB twitter account thought you might find this information
> helpful.
>
> My SaaS start-up uses CouchDB as its primary database. Lately, I have been
> having database corruption issues with version 1.2.0 : every few weeks, one
> of our databases becomes corrupted, which has several negative consequences
> (among others) :
>
>    - Replication of that database fails (it does not even start).
>    - Compaction of that database fails and *freezes* the server.
>    - Several documents in the database become inaccessible through either
>    direct access or through _all_docs.
>
>  The latest affected database does not contain any information about our
> customers, so I am allowed to release it publicly :
>
> http://nicollet.net/public/2013-04-18.couchdb/prod-folder.couch
>
> This database contains 325 irretrievable documents between identifiers
> 2xFEY0pU2Eb and 3Fn6l04G6Oa.
> I hope this helps,
>
> --
> Victor Nicollet, CTO, www.runorg.com