You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Adam Kocoloski <ko...@apache.org> on 2010/08/31 07:25:39 UTC
what to do about invalid UTF-8 in saved documents?
It turns out that mochijson2 will incorrectly decode an invalid UTF-8 string if the illegal byte sequence in the string occurs after an escaped character (COUCHDB-875). This means that one can store documents which will never be successfully retrieved or indexed in CouchDB 1.0. Moreover, once one of these documents makes it into the DB a view build on that DB will never complete.
I wonder what we should do to circumvent that problem? At the very least it might make sense for the view indexer to skip documents which contain invalid UTF-8.
Adam
Re: what to do about invalid UTF-8 in saved documents?
Posted by Adam Kocoloski <ko...@apache.org>.
Yep, it can also be removed by doing DELETE /dbname/docid?rev=...
I think the workaround patch needs to be at a lower level than the view updater, as I believe replication will also break when it encounters the bad document. Regards,
Adam
On Sep 1, 2010, at 2:40 PM, Jan Lehnardt wrote:
> Thanks Adam for finding this one. I ran into it a couple of times and I thought I'm crazy.
>
> I think the view server should skip the invalid doc and print a warning in the log file with the doc id when it does.
>
> I believe a _bulk_doc request with a _deleted:true member still does allow removal of that doc, but I haven't tried in a while.
>
> Cheers
> Jan
> --
>
>
> On 31 Aug 2010, at 07:25, Adam Kocoloski wrote:
>
>> It turns out that mochijson2 will incorrectly decode an invalid UTF-8 string if the illegal byte sequence in the string occurs after an escaped character (COUCHDB-875). This means that one can store documents which will never be successfully retrieved or indexed in CouchDB 1.0. Moreover, once one of these documents makes it into the DB a view build on that DB will never complete.
>>
>> I wonder what we should do to circumvent that problem? At the very least it might make sense for the view indexer to skip documents which contain invalid UTF-8.
>>
>> Adam
>>
>
Re: what to do about invalid UTF-8 in saved documents?
Posted by Jan Lehnardt <ja...@apache.org>.
Thanks Adam for finding this one. I ran into it a couple of times and I thought I'm crazy.
I think the view server should skip the invalid doc and print a warning in the log file with the doc id when it does.
I believe a _bulk_doc request with a _deleted:true member still does allow removal of that doc, but I haven't tried in a while.
Cheers
Jan
--
On 31 Aug 2010, at 07:25, Adam Kocoloski wrote:
> It turns out that mochijson2 will incorrectly decode an invalid UTF-8 string if the illegal byte sequence in the string occurs after an escaped character (COUCHDB-875). This means that one can store documents which will never be successfully retrieved or indexed in CouchDB 1.0. Moreover, once one of these documents makes it into the DB a view build on that DB will never complete.
>
> I wonder what we should do to circumvent that problem? At the very least it might make sense for the view indexer to skip documents which contain invalid UTF-8.
>
> Adam
>