You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Ho-Sheng Hsiao <ho...@isshen.com> on 2008/10/25 09:46:19 UTC

UTF-8 Support?

Hey all,

I'm trying to load the Unihan database into CouchDB (extracted from the
Unicode specification). Parts of it requires passing utf-8 characters,
which according to the JSON specification requires escaping to \uxxxx
format.

Since the initial load has around 71,000 records, I'm using bulk
uploading via:

curl -X POST http://localhost:5984/unihan/_bulk_docs -H "Content-Type:
application/json; charset=utf-8" -d @data/Unihan-5.1.0.json

However, I would run into this error:

[info] [<0.62.0>] HTTP Error (code 500): {'EXIT',
                           {if_clause,
                               [{xmerl_ucs,char_to_utf8,1},
                                {lists,flatmap,2},
                                {cjson,tokenize,2},
                                {cjson,decode1,2},
                                {cjson,decode_object,3},
                                {cjson,decode_array,3},
                                {cjson,decode_object,3},
                                {cjson,json_decode,2}]}}


This error occurred on a recent trunk version as well as the 0.8.1
tarball (sorry, I don't remember the SVN rev number of the version I
used). I had attempted to use the latest trunk version (r707821), but
since that did not even compile, I couldn't try it.

I don't know which record it is barfing on. Pulling a single record out:

{
  "unihan_version": "5.1.0",
  "unihan": {
    "kIRG_GSource":"HZ",
    "kOtherNumeric":"7",
    "kIRGHanyuDaZidian":"10004.020",
    "kDefinition":"the original form for \u4e03 U+4E03",
    "kCihaiT":"10.601",
    "kPhonetic":"1635",
    "kMandarin":"QI1",
    "kCantonese":"cat1",
    "kRSKangXi":"1.1",
    "kHanYu":"10004.020",
    "kRSUnicode":"1.1",
    "kIRGKangXi":"0076.021"},
    "_id":"U+20001"
  }
}

Seems to work fine even with the bulk uploader.

I'm going to attempt to insert the records one by one. Maybe I can find
out which record it is barfing on, maybe the json was invalid. It seems
to me though, that something is barfing on utf8 on bulk uploads over a
certain limit.

If someone wants to try it out, I can supply the json file I used. Any
help is appreciated.

-- 
Ho-Sheng Hsiao, VP of Engineering
Isshen Solutions, Inc.
(334) 559-9153

Re: UTF-8 Support?

Posted by Jan Lehnardt <ja...@apache.org>.
On Oct 25, 2008, at 09:46, Ho-Sheng Hsiao wrote:

> This error occurred on a recent trunk version as well as the 0.8.1
> tarball (sorry, I don't remember the SVN rev number of the version I
> used). I had attempted to use the latest trunk version (r707821), but
> since that did not even compile, I couldn't try it.

Can you try building trunk (r707851 or later) again and paste any build
errors you are seeing?

Cheers
Jan
--



Re: UTF-8 Support?

Posted by Ho-Sheng Hsiao <ho...@isshen.com>.
Chris Anderson wrote:
> If you don't mind, I'll take a look at it. The error you showed sure
> looks like a utf8 error, but with such a big bulk upload it's hard to
> be sure.
>
> Perhaps you can put the Unihan-5.1.0.json file online somewhere, or if
> you have it boiled down to records that are causing the problem,
> singling those out would of course be helpful.

http://windgate.isshen.net/~hhh/couchdb/Unihan-5.1.0.json.gz
http://windgate.isshen.net/~hhh/couchdb/loading.log.gz

In the meantime, I may have found what was causing the utf8 error, and
have found a different error being thrown.

I  modified the extraction script so that it will do a bulk upload with
a single record. There were 9 errors of this type. When I took a look at
the three of the records, it seem pretty obvious:

{"unihan_version":"5.1.0",
  "unihan":{
    "kSemanticVariant":"U+51F9<kLau",
    "kIRG_GSource":"KX",
    "kLau":"2272",
    "kIRGHanyuDaZidian":"10099.060",
    "kDefinition":"(Cant.) \u9152\ud841\udd44, a dimple",
    "kCantonese":"nap1",
    "kRSKangXi":"13.3",
    "kCheungBauer":"013\/05;;nap1",
    "kHanYu":"10099.060",
    "kCowles":"2861",
    "kIRG_TSource":"5-2152",
    "kRSUnicode":"13.3",
    "kMeyerWempe":"1968",
    "kIRGKangXi":"0129.050",
    "kCheungBauerIndex":"341.08"},
  "_id":"U+20544"
}

{"unihan_version":"5.1.0",
  "unihan":{
    "kVietnamese":"b\u1ea3u",
    "kDefinition":"(Cant.) \u751f\ud843\ude12\u4eba, a stranger",
    "kCantonese":"bou2",
    "kRSKangXi":"30.9",
    "kCheungBauer":"030\/09;;bou2",
    "kIRG_VSource":"0-3237",
    "kRSUnicode":"30.9",
    "kIRGKangXi":"0201.121",
    "kCheungBauerIndex":"365.10"},
  "_id":"U+20E12"
}

{"unihan_version":"5.1.0",
  "unihan":{
    "kSemanticVariant":"U+22E23",
    "kIRG_GSource":"KX",
    "kVietnamese":"n\u00edu",
    "kIRGHanyuDaZidian":"31971.020",
    "kDefinition":"(same as U+22E23 \ud84b\ude23) to select, pick",
    "kMandarin":"NIAO3",
    "kRSKangXi":"64.13",
    "kHanYu":"31971.020",
    "kIRG_TSource":"4-5048",
    "kRSUnicode":"64.13",
    "kIRGKangXi":"0458.310"},
  "_id":"U+22D91"
}

What it looks like is that it is barfing on
\u9152\ud841\udd44

The other error I was getting were weirder. I tried matching the error
output with the record by verifying that it made it into the database,
but there may be other records that did not report an error, yet CouchDB
returned a 404 when I tried querying it. What I'll do is write a check
script and have it run through all the records validating that the data
matches the source.

Here's a few of the other errors I was getting:

{"ok":true,"new_revs":[{"id":"U+36B4","rev":"1465697479"}]}

{"error":"EXIT","reason":"{function_clause,[{cjson,tokenize_string,\n
                      [[],\n
{decoder,unicode,null,1,144,any},\n
[115,101,110,111,32,102,111,32,101,102,105,119,41,\n
       22994,32,115,97,32,101,109,97,115,40]]},\n
{cjson,tokenize,2},\n                  {cjson,decode1,2},\n
     {cjson,decode_object,3},\n
{cjson,decode_array,3},\n                  {cjson,decode_object,3},\n
               {cjson,json_decode,2},\n
{couch_httpd,handle_db_request,3}]}"}
{"error":"EXIT","reason":"{function_clause,[{cjson,tokenize_string,\n
                      [[],\n
{decoder,unicode,null,1,205,any},\n
[115,101,110,111,32,44,97,109,100,110,97,114,103,32,\n
         59,110,101,109,111,119,32,114,111,102,32,116,99,\n
              101,112,115,101,114,32,102,111,32,109,114,101,116,\n
                     32,97,32,59,107,108,105,109,32,59,110,97,109,111,\n

119,32,97,32,102,111,32,115,116,115,97,101,114,98,\n
       32,101,104,116,32,41,23341,32,115,97,32,101,109,97,\n
               115,40]]},\n                  {cjson,tokenize,2},\n
            {cjson,decode1,2},\n
{cjson,decode_object,3},\n                  {cjson,decode_array,3},\n
               {cjson,decode_object,3},\n
{cjson,json_decode,2},\n
{couch_httpd,handle_db_request,3}]}"}

{"ok":true,"new_revs":[{"id":"U+36B9","rev":"3226496426"}]}

Records U+36B5 - U+36B8 were not loaded in. Weirdly enough, I think it
is barfing on these two records:


{"unihan_version":"5.1.0",
  "unihan":{
    "kIRG_GSource":"KX",
    "kIRGHanyuDaZidian":"21037.080",
    "kDefinition":"(same as \u59d2)wife of one's husband's elder
brother; (in ancient China) the elder of twins; a Chinese family name,
(same as \u59ec) a handsome girl; a charming girl; a concubine; a
Chinese family name",
    "kMandarin":"SI4",
    "kCantonese":"ci5",
    "kTotalStrokes":"8",
    "kHanYu":"21037.080",
    "kCangjie":"VRLR",
    "kIRG_TSource":"3-2843",
    "kRSUnicode":"38.5",
    "kIRGKangXi":"0258.100"},
  "_id":"U+36B6"
},

{"unihan_version":"5.1.0",
  "unihan":{
    "kIRG_GSource":"KX",
    "kIRGHanyuDaZidian":"21039.040",
    "kDefinition":"(same as \u5b2d) the breasts of a woman; milk; a term
of respect for women; grandma, one's elder sister or sisters, used for a
girl's name","kCihaiT":"383.207","kMandarin":"ER3 NAI3",
    "kCantonese":"nai5",
    "kSBGY":"270.50",
    "kKPS1":"3CFA",
    "kIRG_KPSource":
    "KP1-3CFA",
    "kTotalStrokes":"8",
    "kHanYu":"21039.040",
    "kCangjie":"VOF",
    "kIRG_TSource":"3-2847",
    "kRSUnicode":"38.5",
    "kIRGKangXi":"0258.120"},
  "_id":"U+36B7"
}

Where you have \u59d2) and \u5b2d) ... but why would that effect the
other two records?

As I said, I'll write a checking script and validate all the info is
there. Since it will run or a while, I'll give it a shot after the first
utf8 error gets fixed -- who knows? the first error type might have
something to do with the second error type.

Thanks for your help.


Ho-Sheng Hsiao, VP of Engineering
Isshen Solutions, Inc.
(334) 559-9153
http://www.isshen.com

Re: UTF-8 Support?

Posted by Chris Anderson <jc...@apache.org>.
On Sat, Oct 25, 2008 at 12:46 AM, Ho-Sheng Hsiao <ho...@isshen.com> wrote:
>
> I don't know which record it is barfing on. Pulling a single record out:
>
> {
>  "unihan_version": "5.1.0",
>  "unihan": {
>    "kIRG_GSource":"HZ",
>    "kOtherNumeric":"7",
>    "kIRGHanyuDaZidian":"10004.020",
>    "kDefinition":"the original form for \u4e03 U+4E03",
>    "kCihaiT":"10.601",
>    "kPhonetic":"1635",
>    "kMandarin":"QI1",
>    "kCantonese":"cat1",
>    "kRSKangXi":"1.1",
>    "kHanYu":"10004.020",
>    "kRSUnicode":"1.1",
>    "kIRGKangXi":"0076.021"},
>    "_id":"U+20001"
>  }
> }
>
> Seems to work fine even with the bulk uploader.
>
> I'm going to attempt to insert the records one by one. Maybe I can find
> out which record it is barfing on, maybe the json was invalid. It seems
> to me though, that something is barfing on utf8 on bulk uploads over a
> certain limit.
>
> If someone wants to try it out, I can supply the json file I used. Any
> help is appreciated.

If you don't mind, I'll take a look at it. The error you showed sure
looks like a utf8 error, but with such a big bulk upload it's hard to
be sure.

Perhaps you can put the Unihan-5.1.0.json file online somewhere, or if
you have it boiled down to records that are causing the problem,
singling those out would of course be helpful.

Thanks,
Chris

-- 
Chris Anderson
http://jchris.mfdz.com