You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Adam Kocoloski <ko...@apache.org> on 2010/08/31 07:25:39 UTC

what to do about invalid UTF-8 in saved documents?

It turns out that mochijson2 will incorrectly decode an invalid UTF-8 string if the illegal byte sequence in the string occurs after an escaped character (COUCHDB-875).  This means that one can store documents which will never be successfully retrieved or indexed in CouchDB 1.0.  Moreover, once one of these documents makes it into the DB a view build on that DB will never complete.

I wonder what we should do to circumvent that problem?  At the very least it might make sense for the view indexer to skip documents which contain invalid UTF-8.

Adam

Re: what to do about invalid UTF-8 in saved documents?

Posted by Adam Kocoloski <ko...@apache.org>.

Yep, it can also be removed by doing DELETE /dbname/docid?rev=...

I think the workaround patch needs to be at a lower level than the view updater, as I believe replication will also break when it encounters the bad document.  Regards,

Adam

On Sep 1, 2010, at 2:40 PM, Jan Lehnardt wrote:

> Thanks Adam for finding this one. I ran into it a couple of times and I thought I'm crazy.
> 
> I think the view server should skip the invalid doc and print a warning in the log file with the doc id when it does.
> 
> I believe a _bulk_doc request with a _deleted:true member still does allow removal of that doc, but I haven't tried in a while.
> 
> Cheers
> Jan
> -- 
> 
> 
> On 31 Aug 2010, at 07:25, Adam Kocoloski wrote:
> 
>> It turns out that mochijson2 will incorrectly decode an invalid UTF-8 string if the illegal byte sequence in the string occurs after an escaped character (COUCHDB-875).  This means that one can store documents which will never be successfully retrieved or indexed in CouchDB 1.0.  Moreover, once one of these documents makes it into the DB a view build on that DB will never complete.
>> 
>> I wonder what we should do to circumvent that problem?  At the very least it might make sense for the view indexer to skip documents which contain invalid UTF-8.
>> 
>> Adam
>> 
>

Re: what to do about invalid UTF-8 in saved documents?

Posted by Jan Lehnardt <ja...@apache.org>.

Thanks Adam for finding this one. I ran into it a couple of times and I thought I'm crazy.

I think the view server should skip the invalid doc and print a warning in the log file with the doc id when it does.

I believe a _bulk_doc request with a _deleted:true member still does allow removal of that doc, but I haven't tried in a while.

Cheers
Jan
-- 

On 31 Aug 2010, at 07:25, Adam Kocoloski wrote:

> It turns out that mochijson2 will incorrectly decode an invalid UTF-8 string if the illegal byte sequence in the string occurs after an escaped character (COUCHDB-875).  This means that one can store documents which will never be successfully retrieved or indexed in CouchDB 1.0.  Moreover, once one of these documents makes it into the DB a view build on that DB will never complete.
> 
> I wonder what we should do to circumvent that problem?  At the very least it might make sense for the view indexer to skip documents which contain invalid UTF-8.
> 
> Adam
>